Repository: The-Pocket/PocketFlow-Tutorial-Codebase-Knowledge Branch: main Commit: c8a8ca17180c Files: 203 Total size: 2.8 MB Directory structure: gitextract_ufv5rlmg/ ├── .clinerules ├── .cursorrules ├── .dockerignore ├── .gitignore ├── .windsurfrules ├── Dockerfile ├── LICENSE ├── README.md ├── docs/ │ ├── AutoGen Core/ │ │ ├── 01_agent.md │ │ ├── 02_messaging_system__topic___subscription_.md │ │ ├── 03_agentruntime.md │ │ ├── 04_tool.md │ │ ├── 05_chatcompletionclient.md │ │ ├── 06_chatcompletioncontext.md │ │ ├── 07_memory.md │ │ ├── 08_component.md │ │ └── index.md │ ├── Browser Use/ │ │ ├── 01_agent.md │ │ ├── 02_system_prompt.md │ │ ├── 03_browsercontext.md │ │ ├── 04_dom_representation.md │ │ ├── 05_action_controller___registry.md │ │ ├── 06_message_manager.md │ │ ├── 07_data_structures__views_.md │ │ ├── 08_telemetry_service.md │ │ └── index.md │ ├── Celery/ │ │ ├── 01_celery_app.md │ │ ├── 02_configuration.md │ │ ├── 03_task.md │ │ ├── 04_broker_connection__amqp_.md │ │ ├── 05_worker.md │ │ ├── 06_result_backend.md │ │ ├── 07_beat__scheduler_.md │ │ ├── 08_canvas__signatures___primitives_.md │ │ ├── 09_events.md │ │ ├── 10_bootsteps.md │ │ └── index.md │ ├── Click/ │ │ ├── 01_command___group.md │ │ ├── 02_decorators.md │ │ ├── 03_parameter__option___argument_.md │ │ ├── 04_paramtype.md │ │ ├── 05_context.md │ │ ├── 06_term_ui__terminal_user_interface_.md │ │ ├── 07_click_exceptions.md │ │ └── index.md │ ├── Codex/ │ │ ├── 01_terminal_ui__ink_components_.md │ │ ├── 02_input_handling__textbuffer_editor_.md │ │ ├── 03_agent_loop.md │ │ ├── 04_approval_policy___security.md │ │ ├── 05_response___tool_call_handling.md │ │ ├── 06_command_execution___sandboxing.md │ │ ├── 07_configuration_management.md │ │ ├── 08_single_pass_mode.md │ │ └── index.md │ ├── Crawl4AI/ │ │ ├── 01_asynccrawlerstrategy.md │ │ ├── 02_asyncwebcrawler.md │ │ ├── 03_crawlerrunconfig.md │ │ ├── 04_contentscrapingstrategy.md │ │ ├── 05_relevantcontentfilter.md │ │ ├── 06_extractionstrategy.md │ │ ├── 07_crawlresult.md │ │ ├── 08_deepcrawlstrategy.md │ │ ├── 09_cachecontext___cachemode.md │ │ ├── 10_basedispatcher.md │ │ └── index.md │ ├── CrewAI/ │ │ ├── 01_crew.md │ │ ├── 02_agent.md │ │ ├── 03_task.md │ │ ├── 04_tool.md │ │ ├── 05_process.md │ │ ├── 06_llm.md │ │ ├── 07_memory.md │ │ ├── 08_knowledge.md │ │ └── index.md │ ├── DSPy/ │ │ ├── 01_module___program.md │ │ ├── 02_signature.md │ │ ├── 03_example.md │ │ ├── 04_predict.md │ │ ├── 05_lm__language_model_client_.md │ │ ├── 06_rm__retrieval_model_client_.md │ │ ├── 07_evaluate.md │ │ ├── 08_teleprompter___optimizer.md │ │ ├── 09_adapter.md │ │ ├── 10_settings.md │ │ └── index.md │ ├── FastAPI/ │ │ ├── 01_fastapi_application___routing.md │ │ ├── 02_path_operations___parameter_declaration.md │ │ ├── 03_data_validation___serialization__pydantic_.md │ │ ├── 04_openapi___automatic_docs.md │ │ ├── 05_dependency_injection.md │ │ ├── 06_error_handling.md │ │ ├── 07_security_utilities.md │ │ ├── 08_background_tasks.md │ │ └── index.md │ ├── Flask/ │ │ ├── 01_application_object___flask__.md │ │ ├── 02_routing_system.md │ │ ├── 03_request_and_response_objects.md │ │ ├── 04_templating__jinja2_integration_.md │ │ ├── 05_context_globals___current_app____request____session____g__.md │ │ ├── 06_configuration___config__.md │ │ ├── 07_application_and_request_contexts.md │ │ ├── 08_blueprints.md │ │ └── index.md │ ├── Google A2A/ │ │ ├── 01_agent_card.md │ │ ├── 02_task.md │ │ ├── 03_a2a_protocol___core_types.md │ │ ├── 04_a2a_server_implementation.md │ │ ├── 05_a2a_client_implementation.md │ │ ├── 06_task_handling_logic__server_side_.md │ │ ├── 07_streaming_communication__sse_.md │ │ ├── 08_multi_agent_orchestration__host_agent_.md │ │ ├── 09_demo_ui_application___service.md │ │ └── index.md │ ├── LangGraph/ │ │ ├── 01_graph___stategraph.md │ │ ├── 02_nodes___pregelnode__.md │ │ ├── 03_channels.md │ │ ├── 04_control_flow_primitives___branch____send____interrupt__.md │ │ ├── 05_pregel_execution_engine.md │ │ ├── 06_checkpointer___basecheckpointsaver__.md │ │ └── index.md │ ├── LevelDB/ │ │ ├── 01_table___sstable___tablecache.md │ │ ├── 02_memtable.md │ │ ├── 03_write_ahead_log__wal____logwriter_logreader.md │ │ ├── 04_dbimpl.md │ │ ├── 05_writebatch.md │ │ ├── 06_version___versionset.md │ │ ├── 07_iterator.md │ │ ├── 08_compaction.md │ │ ├── 09_internalkey___dbformat.md │ │ └── index.md │ ├── MCP Python SDK/ │ │ ├── 01_cli___mcp__command_.md │ │ ├── 02_fastmcp_server___fastmcp__.md │ │ ├── 03_fastmcp_resources___resource____resourcemanager__.md │ │ ├── 04_fastmcp_tools___tool____toolmanager__.md │ │ ├── 05_fastmcp_prompts___prompt____promptmanager__.md │ │ ├── 06_fastmcp_context___context__.md │ │ ├── 07_mcp_protocol_types.md │ │ ├── 08_client_server_sessions___clientsession____serversession__.md │ │ ├── 09_communication_transports__stdio__sse__websocket__memory_.md │ │ └── index.md │ ├── NumPy Core/ │ │ ├── 01_ndarray__n_dimensional_array_.md │ │ ├── 02_dtype__data_type_object_.md │ │ ├── 03_ufunc__universal_function_.md │ │ ├── 04_numeric_types___numerictypes__.md │ │ ├── 05_array_printing___arrayprint__.md │ │ ├── 06_multiarray_module.md │ │ ├── 07_umath_module.md │ │ ├── 08___array_function___protocol___overrides___overrides__.md │ │ └── index.md │ ├── OpenManus/ │ │ ├── 01_llm.md │ │ ├── 02_message___memory.md │ │ ├── 03_baseagent.md │ │ ├── 04_tool___toolcollection.md │ │ ├── 05_baseflow.md │ │ ├── 06_schema.md │ │ ├── 07_configuration__config_.md │ │ ├── 08_dockersandbox.md │ │ ├── 09_mcp__model_context_protocol_.md │ │ └── index.md │ ├── PocketFlow/ │ │ ├── 01_shared_state___shared__dictionary__.md │ │ ├── 02_node___basenode____node____asyncnode___.md │ │ ├── 03_actions___transitions_.md │ │ ├── 04_flow___flow____asyncflow___.md │ │ ├── 05_asynchronous_processing___asyncnode____asyncflow___.md │ │ ├── 06_batch_processing___batchnode____batchflow____asyncparallelbatchnode___.md │ │ ├── 07_a2a__agent_to_agent__communication_framework_.md │ │ └── index.md │ ├── Pydantic Core/ │ │ ├── 01_basemodel.md │ │ ├── 02_fields__fieldinfo___field_function_.md │ │ ├── 03_configuration__configdict___configwrapper_.md │ │ ├── 04_custom_logic__decorators___annotated_helpers_.md │ │ ├── 05_core_schema___validation_serialization.md │ │ ├── 06_typeadapter.md │ │ └── index.md │ ├── Requests/ │ │ ├── 01_functional_api.md │ │ ├── 02_request___response_models.md │ │ ├── 03_session.md │ │ ├── 04_cookie_jar.md │ │ ├── 05_authentication_handlers.md │ │ ├── 06_exception_hierarchy.md │ │ ├── 07_transport_adapters.md │ │ ├── 08_hook_system.md │ │ └── index.md │ ├── SmolaAgents/ │ │ ├── 01_multistepagent.md │ │ ├── 02_model_interface.md │ │ ├── 03_tool.md │ │ ├── 04_agentmemory.md │ │ ├── 05_prompttemplates.md │ │ ├── 06_pythonexecutor.md │ │ ├── 07_agenttype.md │ │ ├── 08_agentlogger___monitor.md │ │ └── index.md │ ├── _config.yml │ ├── design.md │ └── index.md ├── flow.py ├── main.py ├── nodes.py ├── requirements.txt └── utils/ ├── __init__.py ├── call_llm.py ├── crawl_github_files.py └── crawl_local_files.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .clinerules ================================================ --- layout: default title: "Agentic Coding" --- # Agentic Coding: Humans Design, Agents code! > If you are an AI agents involved in building LLM Systems, read this guide **VERY, VERY** carefully! This is the most important chapter in the entire document. Throughout development, you should always (1) start with a small and simple solution, (2) design at a high level (`docs/design.md`) before implementation, and (3) frequently ask humans for feedback and clarification. {: .warning } ## Agentic Coding Steps Agentic Coding should be a collaboration between Human System Design and Agent Implementation: | Steps | Human | AI | Comment | |:-----------------------|:----------:|:---------:|:------------------------------------------------------------------------| | 1. Requirements | ★★★ High | ★☆☆ Low | Humans understand the requirements and context. | | 2. Flow | ★★☆ Medium | ★★☆ Medium | Humans specify the high-level design, and the AI fills in the details. | | 3. Utilities | ★★☆ Medium | ★★☆ Medium | Humans provide available external APIs and integrations, and the AI helps with implementation. | | 4. Node | ★☆☆ Low | ★★★ High | The AI helps design the node types and data handling based on the flow. | | 5. Implementation | ★☆☆ Low | ★★★ High | The AI implements the flow based on the design. | | 6. Optimization | ★★☆ Medium | ★★☆ Medium | Humans evaluate the results, and the AI helps optimize. | | 7. Reliability | ★☆☆ Low | ★★★ High | The AI writes test cases and addresses corner cases. | 1. **Requirements**: Clarify the requirements for your project, and evaluate whether an AI system is a good fit. - Understand AI systems' strengths and limitations: - **Good for**: Routine tasks requiring common sense (filling forms, replying to emails) - **Good for**: Creative tasks with well-defined inputs (building slides, writing SQL) - **Not good for**: Ambiguous problems requiring complex decision-making (business strategy, startup planning) - **Keep It User-Centric:** Explain the "problem" from the user's perspective rather than just listing features. - **Balance complexity vs. impact**: Aim to deliver the highest value features with minimal complexity early. 2. **Flow Design**: Outline at a high level, describe how your AI system orchestrates nodes. - Identify applicable design patterns (e.g., [Map Reduce](./design_pattern/mapreduce.md), [Agent](./design_pattern/agent.md), [RAG](./design_pattern/rag.md)). - For each node in the flow, start with a high-level one-line description of what it does. - If using **Map Reduce**, specify how to map (what to split) and how to reduce (how to combine). - If using **Agent**, specify what are the inputs (context) and what are the possible actions. - If using **RAG**, specify what to embed, noting that there's usually both offline (indexing) and online (retrieval) workflows. - Outline the flow and draw it in a mermaid diagram. For example: ```mermaid flowchart LR start[Start] --> batch[Batch] batch --> check[Check] check -->|OK| process check -->|Error| fix[Fix] fix --> check subgraph process[Process] step1[Step 1] --> step2[Step 2] end process --> endNode[End] ``` - > **If Humans can't specify the flow, AI Agents can't automate it!** Before building an LLM system, thoroughly understand the problem and potential solution by manually solving example inputs to develop intuition. {: .best-practice } 3. **Utilities**: Based on the Flow Design, identify and implement necessary utility functions. - Think of your AI system as the brain. It needs a body—these *external utility functions*—to interact with the real world:
- Reading inputs (e.g., retrieving Slack messages, reading emails) - Writing outputs (e.g., generating reports, sending emails) - Using external tools (e.g., calling LLMs, searching the web) - **NOTE**: *LLM-based tasks* (e.g., summarizing text, analyzing sentiment) are **NOT** utility functions; rather, they are *core functions* internal in the AI system. - For each utility function, implement it and write a simple test. - Document their input/output, as well as why they are necessary. For example: - `name`: `get_embedding` (`utils/get_embedding.py`) - `input`: `str` - `output`: a vector of 3072 floats - `necessity`: Used by the second node to embed text - Example utility implementation: ```python # utils/call_llm.py from openai import OpenAI def call_llm(prompt): client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content if __name__ == "__main__": prompt = "What is the meaning of life?" print(call_llm(prompt)) ``` - > **Sometimes, design Utilies before Flow:** For example, for an LLM project to automate a legacy system, the bottleneck will likely be the available interface to that system. Start by designing the hardest utilities for interfacing, and then build the flow around them. {: .best-practice } 4. **Node Design**: Plan how each node will read and write data, and use utility functions. - One core design principle for PocketFlow is to use a [shared store](./core_abstraction/communication.md), so start with a shared store design: - For simple systems, use an in-memory dictionary. - For more complex systems or when persistence is required, use a database. - **Don't Repeat Yourself**: Use in-memory references or foreign keys. - Example shared store design: ```python shared = { "user": { "id": "user123", "context": { # Another nested dict "weather": {"temp": 72, "condition": "sunny"}, "location": "San Francisco" } }, "results": {} # Empty dict to store outputs } ``` - For each [Node](./core_abstraction/node.md), describe its type, how it reads and writes data, and which utility function it uses. Keep it specific but high-level without codes. For example: - `type`: Regular (or Batch, or Async) - `prep`: Read "text" from the shared store - `exec`: Call the embedding utility function - `post`: Write "embedding" to the shared store 5. **Implementation**: Implement the initial nodes and flows based on the design. - 🎉 If you've reached this step, humans have finished the design. Now *Agentic Coding* begins! - **"Keep it simple, stupid!"** Avoid complex features and full-scale type checking. - **FAIL FAST**! Avoid `try` logic so you can quickly identify any weak points in the system. - Add logging throughout the code to facilitate debugging. 7. **Optimization**: - **Use Intuition**: For a quick initial evaluation, human intuition is often a good start. - **Redesign Flow (Back to Step 3)**: Consider breaking down tasks further, introducing agentic decisions, or better managing input contexts. - If your flow design is already solid, move on to micro-optimizations: - **Prompt Engineering**: Use clear, specific instructions with examples to reduce ambiguity. - **In-Context Learning**: Provide robust examples for tasks that are difficult to specify with instructions alone. - > **You'll likely iterate a lot!** Expect to repeat Steps 3–6 hundreds of times. > >
{: .best-practice } 8. **Reliability** - **Node Retries**: Add checks in the node `exec` to ensure outputs meet requirements, and consider increasing `max_retries` and `wait` times. - **Logging and Visualization**: Maintain logs of all attempts and visualize node results for easier debugging. - **Self-Evaluation**: Add a separate node (powered by an LLM) to review outputs when results are uncertain. ## Example LLM Project File Structure ``` my_project/ ├── main.py ├── nodes.py ├── flow.py ├── utils/ │ ├── __init__.py │ ├── call_llm.py │ └── search_web.py ├── requirements.txt └── docs/ └── design.md ``` - **`docs/design.md`**: Contains project documentation for each step above. This should be *high-level* and *no-code*. - **`utils/`**: Contains all utility functions. - It's recommended to dedicate one Python file to each API call, for example `call_llm.py` or `search_web.py`. - Each file should also include a `main()` function to try that API call - **`nodes.py`**: Contains all the node definitions. ```python # nodes.py from pocketflow import Node from utils.call_llm import call_llm class GetQuestionNode(Node): def exec(self, _): # Get question directly from user input user_question = input("Enter your question: ") return user_question def post(self, shared, prep_res, exec_res): # Store the user's question shared["question"] = exec_res return "default" # Go to the next node class AnswerNode(Node): def prep(self, shared): # Read question from shared return shared["question"] def exec(self, question): # Call LLM to get the answer return call_llm(question) def post(self, shared, prep_res, exec_res): # Store the answer in shared shared["answer"] = exec_res ``` - **`flow.py`**: Implements functions that create flows by importing node definitions and connecting them. ```python # flow.py from pocketflow import Flow from nodes import GetQuestionNode, AnswerNode def create_qa_flow(): """Create and return a question-answering flow.""" # Create nodes get_question_node = GetQuestionNode() answer_node = AnswerNode() # Connect nodes in sequence get_question_node >> answer_node # Create flow starting with input node return Flow(start=get_question_node) ``` - **`main.py`**: Serves as the project's entry point. ```python # main.py from flow import create_qa_flow # Example main function # Please replace this with your own main function def main(): shared = { "question": None, # Will be populated by GetQuestionNode from user input "answer": None # Will be populated by AnswerNode } # Create the flow and run it qa_flow = create_qa_flow() qa_flow.run(shared) print(f"Question: {shared['question']}") print(f"Answer: {shared['answer']}") if __name__ == "__main__": main() ``` ================================================ File: docs/index.md ================================================ --- layout: default title: "Home" nav_order: 1 --- # Pocket Flow A [100-line](https://github.com/the-pocket/PocketFlow/blob/main/pocketflow/__init__.py) minimalist LLM framework for *Agents, Task Decomposition, RAG, etc*. - **Lightweight**: Just the core graph abstraction in 100 lines. ZERO dependencies, and vendor lock-in. - **Expressive**: Everything you love from larger frameworks—([Multi-](./design_pattern/multi_agent.html))[Agents](./design_pattern/agent.html), [Workflow](./design_pattern/workflow.html), [RAG](./design_pattern/rag.html), and more. - **Agentic-Coding**: Intuitive enough for AI agents to help humans build complex LLM applications.
## Core Abstraction We model the LLM workflow as a **Graph + Shared Store**: - [Node](./core_abstraction/node.md) handles simple (LLM) tasks. - [Flow](./core_abstraction/flow.md) connects nodes through **Actions** (labeled edges). - [Shared Store](./core_abstraction/communication.md) enables communication between nodes within flows. - [Batch](./core_abstraction/batch.md) nodes/flows allow for data-intensive tasks. - [Async](./core_abstraction/async.md) nodes/flows allow waiting for asynchronous tasks. - [(Advanced) Parallel](./core_abstraction/parallel.md) nodes/flows handle I/O-bound tasks.
## Design Pattern From there, it’s easy to implement popular design patterns: - [Agent](./design_pattern/agent.md) autonomously makes decisions. - [Workflow](./design_pattern/workflow.md) chains multiple tasks into pipelines. - [RAG](./design_pattern/rag.md) integrates data retrieval with generation. - [Map Reduce](./design_pattern/mapreduce.md) splits data tasks into Map and Reduce steps. - [Structured Output](./design_pattern/structure.md) formats outputs consistently. - [(Advanced) Multi-Agents](./design_pattern/multi_agent.md) coordinate multiple agents.
## Utility Function We **do not** provide built-in utilities. Instead, we offer *examples*—please *implement your own*: - [LLM Wrapper](./utility_function/llm.md) - [Viz and Debug](./utility_function/viz.md) - [Web Search](./utility_function/websearch.md) - [Chunking](./utility_function/chunking.md) - [Embedding](./utility_function/embedding.md) - [Vector Databases](./utility_function/vector.md) - [Text-to-Speech](./utility_function/text_to_speech.md) **Why not built-in?**: I believe it's a *bad practice* for vendor-specific APIs in a general framework: - *API Volatility*: Frequent changes lead to heavy maintenance for hardcoded APIs. - *Flexibility*: You may want to switch vendors, use fine-tuned models, or run them locally. - *Optimizations*: Prompt caching, batching, and streaming are easier without vendor lock-in. ## Ready to build your Apps? Check out [Agentic Coding Guidance](./guide.md), the fastest way to develop LLM projects with Pocket Flow! ================================================ File: docs/core_abstraction/async.md ================================================ --- layout: default title: "(Advanced) Async" parent: "Core Abstraction" nav_order: 5 --- # (Advanced) Async **Async** Nodes implement `prep_async()`, `exec_async()`, `exec_fallback_async()`, and/or `post_async()`. This is useful for: 1. **prep_async()**: For *fetching/reading data (files, APIs, DB)* in an I/O-friendly way. 2. **exec_async()**: Typically used for async LLM calls. 3. **post_async()**: For *awaiting user feedback*, *coordinating across multi-agents* or any additional async steps after `exec_async()`. **Note**: `AsyncNode` must be wrapped in `AsyncFlow`. `AsyncFlow` can also include regular (sync) nodes. ### Example ```python class SummarizeThenVerify(AsyncNode): async def prep_async(self, shared): # Example: read a file asynchronously doc_text = await read_file_async(shared["doc_path"]) return doc_text async def exec_async(self, prep_res): # Example: async LLM call summary = await call_llm_async(f"Summarize: {prep_res}") return summary async def post_async(self, shared, prep_res, exec_res): # Example: wait for user feedback decision = await gather_user_feedback(exec_res) if decision == "approve": shared["summary"] = exec_res return "approve" return "deny" summarize_node = SummarizeThenVerify() final_node = Finalize() # Define transitions summarize_node - "approve" >> final_node summarize_node - "deny" >> summarize_node # retry flow = AsyncFlow(start=summarize_node) async def main(): shared = {"doc_path": "document.txt"} await flow.run_async(shared) print("Final Summary:", shared.get("summary")) asyncio.run(main()) ``` ================================================ File: docs/core_abstraction/batch.md ================================================ --- layout: default title: "Batch" parent: "Core Abstraction" nav_order: 4 --- # Batch **Batch** makes it easier to handle large inputs in one Node or **rerun** a Flow multiple times. Example use cases: - **Chunk-based** processing (e.g., splitting large texts). - **Iterative** processing over lists of input items (e.g., user queries, files, URLs). ## 1. BatchNode A **BatchNode** extends `Node` but changes `prep()` and `exec()`: - **`prep(shared)`**: returns an **iterable** (e.g., list, generator). - **`exec(item)`**: called **once** per item in that iterable. - **`post(shared, prep_res, exec_res_list)`**: after all items are processed, receives a **list** of results (`exec_res_list`) and returns an **Action**. ### Example: Summarize a Large File ```python class MapSummaries(BatchNode): def prep(self, shared): # Suppose we have a big file; chunk it content = shared["data"] chunk_size = 10000 chunks = [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)] return chunks def exec(self, chunk): prompt = f"Summarize this chunk in 10 words: {chunk}" summary = call_llm(prompt) return summary def post(self, shared, prep_res, exec_res_list): combined = "\n".join(exec_res_list) shared["summary"] = combined return "default" map_summaries = MapSummaries() flow = Flow(start=map_summaries) flow.run(shared) ``` --- ## 2. BatchFlow A **BatchFlow** runs a **Flow** multiple times, each time with different `params`. Think of it as a loop that replays the Flow for each parameter set. ### Example: Summarize Many Files ```python class SummarizeAllFiles(BatchFlow): def prep(self, shared): # Return a list of param dicts (one per file) filenames = list(shared["data"].keys()) # e.g., ["file1.txt", "file2.txt", ...] return [{"filename": fn} for fn in filenames] # Suppose we have a per-file Flow (e.g., load_file >> summarize >> reduce): summarize_file = SummarizeFile(start=load_file) # Wrap that flow into a BatchFlow: summarize_all_files = SummarizeAllFiles(start=summarize_file) summarize_all_files.run(shared) ``` ### Under the Hood 1. `prep(shared)` returns a list of param dicts—e.g., `[{filename: "file1.txt"}, {filename: "file2.txt"}, ...]`. 2. The **BatchFlow** loops through each dict. For each one: - It merges the dict with the BatchFlow’s own `params`. - It calls `flow.run(shared)` using the merged result. 3. This means the sub-Flow is run **repeatedly**, once for every param dict. --- ## 3. Nested or Multi-Level Batches You can nest a **BatchFlow** in another **BatchFlow**. For instance: - **Outer** batch: returns a list of diretory param dicts (e.g., `{"directory": "/pathA"}`, `{"directory": "/pathB"}`, ...). - **Inner** batch: returning a list of per-file param dicts. At each level, **BatchFlow** merges its own param dict with the parent’s. By the time you reach the **innermost** node, the final `params` is the merged result of **all** parents in the chain. This way, a nested structure can keep track of the entire context (e.g., directory + file name) at once. ```python class FileBatchFlow(BatchFlow): def prep(self, shared): directory = self.params["directory"] # e.g., files = ["file1.txt", "file2.txt", ...] files = [f for f in os.listdir(directory) if f.endswith(".txt")] return [{"filename": f} for f in files] class DirectoryBatchFlow(BatchFlow): def prep(self, shared): directories = [ "/path/to/dirA", "/path/to/dirB"] return [{"directory": d} for d in directories] # MapSummaries have params like {"directory": "/path/to/dirA", "filename": "file1.txt"} inner_flow = FileBatchFlow(start=MapSummaries()) outer_flow = DirectoryBatchFlow(start=inner_flow) ``` ================================================ File: docs/core_abstraction/communication.md ================================================ --- layout: default title: "Communication" parent: "Core Abstraction" nav_order: 3 --- # Communication Nodes and Flows **communicate** in 2 ways: 1. **Shared Store (for almost all the cases)** - A global data structure (often an in-mem dict) that all nodes can read ( `prep()`) and write (`post()`). - Great for data results, large content, or anything multiple nodes need. - You shall design the data structure and populate it ahead. - > **Separation of Concerns:** Use `Shared Store` for almost all cases to separate *Data Schema* from *Compute Logic*! This approach is both flexible and easy to manage, resulting in more maintainable code. `Params` is more a syntax sugar for [Batch](./batch.md). {: .best-practice } 2. **Params (only for [Batch](./batch.md))** - Each node has a local, ephemeral `params` dict passed in by the **parent Flow**, used as an identifier for tasks. Parameter keys and values shall be **immutable**. - Good for identifiers like filenames or numeric IDs, in Batch mode. If you know memory management, think of the **Shared Store** like a **heap** (shared by all function calls), and **Params** like a **stack** (assigned by the caller). --- ## 1. Shared Store ### Overview A shared store is typically an in-mem dictionary, like: ```python shared = {"data": {}, "summary": {}, "config": {...}, ...} ``` It can also contain local file handlers, DB connections, or a combination for persistence. We recommend deciding the data structure or DB schema first based on your app requirements. ### Example ```python class LoadData(Node): def post(self, shared, prep_res, exec_res): # We write data to shared store shared["data"] = "Some text content" return None class Summarize(Node): def prep(self, shared): # We read data from shared store return shared["data"] def exec(self, prep_res): # Call LLM to summarize prompt = f"Summarize: {prep_res}" summary = call_llm(prompt) return summary def post(self, shared, prep_res, exec_res): # We write summary to shared store shared["summary"] = exec_res return "default" load_data = LoadData() summarize = Summarize() load_data >> summarize flow = Flow(start=load_data) shared = {} flow.run(shared) ``` Here: - `LoadData` writes to `shared["data"]`. - `Summarize` reads from `shared["data"]`, summarizes, and writes to `shared["summary"]`. --- ## 2. Params **Params** let you store *per-Node* or *per-Flow* config that doesn't need to live in the shared store. They are: - **Immutable** during a Node's run cycle (i.e., they don't change mid-`prep->exec->post`). - **Set** via `set_params()`. - **Cleared** and updated each time a parent Flow calls it. > Only set the uppermost Flow params because others will be overwritten by the parent Flow. > > If you need to set child node params, see [Batch](./batch.md). {: .warning } Typically, **Params** are identifiers (e.g., file name, page number). Use them to fetch the task you assigned or write to a specific part of the shared store. ### Example ```python # 1) Create a Node that uses params class SummarizeFile(Node): def prep(self, shared): # Access the node's param filename = self.params["filename"] return shared["data"].get(filename, "") def exec(self, prep_res): prompt = f"Summarize: {prep_res}" return call_llm(prompt) def post(self, shared, prep_res, exec_res): filename = self.params["filename"] shared["summary"][filename] = exec_res return "default" # 2) Set params node = SummarizeFile() # 3) Set Node params directly (for testing) node.set_params({"filename": "doc1.txt"}) node.run(shared) # 4) Create Flow flow = Flow(start=node) # 5) Set Flow params (overwrites node params) flow.set_params({"filename": "doc2.txt"}) flow.run(shared) # The node summarizes doc2, not doc1 ``` ================================================ File: docs/core_abstraction/flow.md ================================================ --- layout: default title: "Flow" parent: "Core Abstraction" nav_order: 2 --- # Flow A **Flow** orchestrates a graph of Nodes. You can chain Nodes in a sequence or create branching depending on the **Actions** returned from each Node's `post()`. ## 1. Action-based Transitions Each Node's `post()` returns an **Action** string. By default, if `post()` doesn't return anything, we treat that as `"default"`. You define transitions with the syntax: 1. **Basic default transition**: `node_a >> node_b` This means if `node_a.post()` returns `"default"`, go to `node_b`. (Equivalent to `node_a - "default" >> node_b`) 2. **Named action transition**: `node_a - "action_name" >> node_b` This means if `node_a.post()` returns `"action_name"`, go to `node_b`. It's possible to create loops, branching, or multi-step flows. ## 2. Creating a Flow A **Flow** begins with a **start** node. You call `Flow(start=some_node)` to specify the entry point. When you call `flow.run(shared)`, it executes the start node, looks at its returned Action from `post()`, follows the transition, and continues until there's no next node. ### Example: Simple Sequence Here's a minimal flow of two nodes in a chain: ```python node_a >> node_b flow = Flow(start=node_a) flow.run(shared) ``` - When you run the flow, it executes `node_a`. - Suppose `node_a.post()` returns `"default"`. - The flow then sees `"default"` Action is linked to `node_b` and runs `node_b`. - `node_b.post()` returns `"default"` but we didn't define `node_b >> something_else`. So the flow ends there. ### Example: Branching & Looping Here's a simple expense approval flow that demonstrates branching and looping. The `ReviewExpense` node can return three possible Actions: - `"approved"`: expense is approved, move to payment processing - `"needs_revision"`: expense needs changes, send back for revision - `"rejected"`: expense is denied, finish the process We can wire them like this: ```python # Define the flow connections review - "approved" >> payment # If approved, process payment review - "needs_revision" >> revise # If needs changes, go to revision review - "rejected" >> finish # If rejected, finish the process revise >> review # After revision, go back for another review payment >> finish # After payment, finish the process flow = Flow(start=review) ``` Let's see how it flows: 1. If `review.post()` returns `"approved"`, the expense moves to the `payment` node 2. If `review.post()` returns `"needs_revision"`, it goes to the `revise` node, which then loops back to `review` 3. If `review.post()` returns `"rejected"`, it moves to the `finish` node and stops ```mermaid flowchart TD review[Review Expense] -->|approved| payment[Process Payment] review -->|needs_revision| revise[Revise Report] review -->|rejected| finish[Finish Process] revise --> review payment --> finish ``` ### Running Individual Nodes vs. Running a Flow - `node.run(shared)`: Just runs that node alone (calls `prep->exec->post()`), returns an Action. - `flow.run(shared)`: Executes from the start node, follows Actions to the next node, and so on until the flow can't continue. > `node.run(shared)` **does not** proceed to the successor. > This is mainly for debugging or testing a single node. > > Always use `flow.run(...)` in production to ensure the full pipeline runs correctly. {: .warning } ## 3. Nested Flows A **Flow** can act like a Node, which enables powerful composition patterns. This means you can: 1. Use a Flow as a Node within another Flow's transitions. 2. Combine multiple smaller Flows into a larger Flow for reuse. 3. Node `params` will be a merging of **all** parents' `params`. ### Flow's Node Methods A **Flow** is also a **Node**, so it will run `prep()` and `post()`. However: - It **won't** run `exec()`, as its main logic is to orchestrate its nodes. - `post()` always receives `None` for `exec_res` and should instead get the flow execution results from the shared store. ### Basic Flow Nesting Here's how to connect a flow to another node: ```python # Create a sub-flow node_a >> node_b subflow = Flow(start=node_a) # Connect it to another node subflow >> node_c # Create the parent flow parent_flow = Flow(start=subflow) ``` When `parent_flow.run()` executes: 1. It starts `subflow` 2. `subflow` runs through its nodes (`node_a->node_b`) 3. After `subflow` completes, execution continues to `node_c` ### Example: Order Processing Pipeline Here's a practical example that breaks down order processing into nested flows: ```python # Payment processing sub-flow validate_payment >> process_payment >> payment_confirmation payment_flow = Flow(start=validate_payment) # Inventory sub-flow check_stock >> reserve_items >> update_inventory inventory_flow = Flow(start=check_stock) # Shipping sub-flow create_label >> assign_carrier >> schedule_pickup shipping_flow = Flow(start=create_label) # Connect the flows into a main order pipeline payment_flow >> inventory_flow >> shipping_flow # Create the master flow order_pipeline = Flow(start=payment_flow) # Run the entire pipeline order_pipeline.run(shared_data) ``` This creates a clean separation of concerns while maintaining a clear execution path: ```mermaid flowchart LR subgraph order_pipeline[Order Pipeline] subgraph paymentFlow["Payment Flow"] A[Validate Payment] --> B[Process Payment] --> C[Payment Confirmation] end subgraph inventoryFlow["Inventory Flow"] D[Check Stock] --> E[Reserve Items] --> F[Update Inventory] end subgraph shippingFlow["Shipping Flow"] G[Create Label] --> H[Assign Carrier] --> I[Schedule Pickup] end paymentFlow --> inventoryFlow inventoryFlow --> shippingFlow end ``` ================================================ File: docs/core_abstraction/node.md ================================================ --- layout: default title: "Node" parent: "Core Abstraction" nav_order: 1 --- # Node A **Node** is the smallest building block. Each Node has 3 steps `prep->exec->post`:
1. `prep(shared)` - **Read and preprocess data** from `shared` store. - Examples: *query DB, read files, or serialize data into a string*. - Return `prep_res`, which is used by `exec()` and `post()`. 2. `exec(prep_res)` - **Execute compute logic**, with optional retries and error handling (below). - Examples: *(mostly) LLM calls, remote APIs, tool use*. - ⚠️ This shall be only for compute and **NOT** access `shared`. - ⚠️ If retries enabled, ensure idempotent implementation. - Return `exec_res`, which is passed to `post()`. 3. `post(shared, prep_res, exec_res)` - **Postprocess and write data** back to `shared`. - Examples: *update DB, change states, log results*. - **Decide the next action** by returning a *string* (`action = "default"` if *None*). > **Why 3 steps?** To enforce the principle of *separation of concerns*. The data storage and data processing are operated separately. > > All steps are *optional*. E.g., you can only implement `prep` and `post` if you just need to process data. {: .note } ### Fault Tolerance & Retries You can **retry** `exec()` if it raises an exception via two parameters when define the Node: - `max_retries` (int): Max times to run `exec()`. The default is `1` (**no** retry). - `wait` (int): The time to wait (in **seconds**) before next retry. By default, `wait=0` (no waiting). `wait` is helpful when you encounter rate-limits or quota errors from your LLM provider and need to back off. ```python my_node = SummarizeFile(max_retries=3, wait=10) ``` When an exception occurs in `exec()`, the Node automatically retries until: - It either succeeds, or - The Node has retried `max_retries - 1` times already and fails on the last attempt. You can get the current retry times (0-based) from `self.cur_retry`. ```python class RetryNode(Node): def exec(self, prep_res): print(f"Retry {self.cur_retry} times") raise Exception("Failed") ``` ### Graceful Fallback To **gracefully handle** the exception (after all retries) rather than raising it, override: ```python def exec_fallback(self, prep_res, exc): raise exc ``` By default, it just re-raises exception. But you can return a fallback result instead, which becomes the `exec_res` passed to `post()`. ### Example: Summarize file ```python class SummarizeFile(Node): def prep(self, shared): return shared["data"] def exec(self, prep_res): if not prep_res: return "Empty file content" prompt = f"Summarize this text in 10 words: {prep_res}" summary = call_llm(prompt) # might fail return summary def exec_fallback(self, prep_res, exc): # Provide a simple fallback instead of crashing return "There was an error processing your request." def post(self, shared, prep_res, exec_res): shared["summary"] = exec_res # Return "default" by not returning summarize_node = SummarizeFile(max_retries=3) # node.run() calls prep->exec->post # If exec() fails, it retries up to 3 times before calling exec_fallback() action_result = summarize_node.run(shared) print("Action returned:", action_result) # "default" print("Summary stored:", shared["summary"]) ``` ================================================ File: docs/core_abstraction/parallel.md ================================================ --- layout: default title: "(Advanced) Parallel" parent: "Core Abstraction" nav_order: 6 --- # (Advanced) Parallel **Parallel** Nodes and Flows let you run multiple **Async** Nodes and Flows **concurrently**—for example, summarizing multiple texts at once. This can improve performance by overlapping I/O and compute. > Because of Python’s GIL, parallel nodes and flows can’t truly parallelize CPU-bound tasks (e.g., heavy numerical computations). However, they excel at overlapping I/O-bound work—like LLM calls, database queries, API requests, or file I/O. {: .warning } > - **Ensure Tasks Are Independent**: If each item depends on the output of a previous item, **do not** parallelize. > > - **Beware of Rate Limits**: Parallel calls can **quickly** trigger rate limits on LLM services. You may need a **throttling** mechanism (e.g., semaphores or sleep intervals). > > - **Consider Single-Node Batch APIs**: Some LLMs offer a **batch inference** API where you can send multiple prompts in a single call. This is more complex to implement but can be more efficient than launching many parallel requests and mitigates rate limits. {: .best-practice } ## AsyncParallelBatchNode Like **AsyncBatchNode**, but run `exec_async()` in **parallel**: ```python class ParallelSummaries(AsyncParallelBatchNode): async def prep_async(self, shared): # e.g., multiple texts return shared["texts"] async def exec_async(self, text): prompt = f"Summarize: {text}" return await call_llm_async(prompt) async def post_async(self, shared, prep_res, exec_res_list): shared["summary"] = "\n\n".join(exec_res_list) return "default" node = ParallelSummaries() flow = AsyncFlow(start=node) ``` ## AsyncParallelBatchFlow Parallel version of **BatchFlow**. Each iteration of the sub-flow runs **concurrently** using different parameters: ```python class SummarizeMultipleFiles(AsyncParallelBatchFlow): async def prep_async(self, shared): return [{"filename": f} for f in shared["files"]] sub_flow = AsyncFlow(start=LoadAndSummarizeFile()) parallel_flow = SummarizeMultipleFiles(start=sub_flow) await parallel_flow.run_async(shared) ``` ================================================ File: docs/design_pattern/agent.md ================================================ --- layout: default title: "Agent" parent: "Design Pattern" nav_order: 1 --- # Agent Agent is a powerful design pattern in which nodes can take dynamic actions based on the context.
## Implement Agent with Graph 1. **Context and Action:** Implement nodes that supply context and perform actions. 2. **Branching:** Use branching to connect each action node to an agent node. Use action to allow the agent to direct the [flow](../core_abstraction/flow.md) between nodes—and potentially loop back for multi-step. 3. **Agent Node:** Provide a prompt to decide action—for example: ```python f""" ### CONTEXT Task: {task_description} Previous Actions: {previous_actions} Current State: {current_state} ### ACTION SPACE [1] search Description: Use web search to get results Parameters: - query (str): What to search for [2] answer Description: Conclude based on the results Parameters: - result (str): Final answer to provide ### NEXT ACTION Decide the next action based on the current context and available action space. Return your response in the following format: ```yaml thinking: | action: parameters: : ```""" ``` The core of building **high-performance** and **reliable** agents boils down to: 1. **Context Management:** Provide *relevant, minimal context.* For example, rather than including an entire chat history, retrieve the most relevant via [RAG](./rag.md). Even with larger context windows, LLMs still fall victim to ["lost in the middle"](https://arxiv.org/abs/2307.03172), overlooking mid-prompt content. 2. **Action Space:** Provide *a well-structured and unambiguous* set of actions—avoiding overlap like separate `read_databases` or `read_csvs`. Instead, import CSVs into the database. ## Example Good Action Design - **Incremental:** Feed content in manageable chunks (500 lines or 1 page) instead of all at once. - **Overview-zoom-in:** First provide high-level structure (table of contents, summary), then allow drilling into details (raw texts). - **Parameterized/Programmable:** Instead of fixed actions, enable parameterized (columns to select) or programmable (SQL queries) actions, for example, to read CSV files. - **Backtracking:** Let the agent undo the last step instead of restarting entirely, preserving progress when encountering errors or dead ends. ## Example: Search Agent This agent: 1. Decides whether to search or answer 2. If searches, loops back to decide if more search needed 3. Answers when enough context gathered ```python class DecideAction(Node): def prep(self, shared): context = shared.get("context", "No previous search") query = shared["query"] return query, context def exec(self, inputs): query, context = inputs prompt = f""" Given input: {query} Previous search results: {context} Should I: 1) Search web for more info 2) Answer with current knowledge Output in yaml: ```yaml action: search/answer reason: why this action search_term: search phrase if action is search ```""" resp = call_llm(prompt) yaml_str = resp.split("```yaml")[1].split("```")[0].strip() result = yaml.safe_load(yaml_str) assert isinstance(result, dict) assert "action" in result assert "reason" in result assert result["action"] in ["search", "answer"] if result["action"] == "search": assert "search_term" in result return result def post(self, shared, prep_res, exec_res): if exec_res["action"] == "search": shared["search_term"] = exec_res["search_term"] return exec_res["action"] class SearchWeb(Node): def prep(self, shared): return shared["search_term"] def exec(self, search_term): return search_web(search_term) def post(self, shared, prep_res, exec_res): prev_searches = shared.get("context", []) shared["context"] = prev_searches + [ {"term": shared["search_term"], "result": exec_res} ] return "decide" class DirectAnswer(Node): def prep(self, shared): return shared["query"], shared.get("context", "") def exec(self, inputs): query, context = inputs return call_llm(f"Context: {context}\nAnswer: {query}") def post(self, shared, prep_res, exec_res): print(f"Answer: {exec_res}") shared["answer"] = exec_res # Connect nodes decide = DecideAction() search = SearchWeb() answer = DirectAnswer() decide - "search" >> search decide - "answer" >> answer search - "decide" >> decide # Loop back flow = Flow(start=decide) flow.run({"query": "Who won the Nobel Prize in Physics 2024?"}) ``` ================================================ File: docs/design_pattern/mapreduce.md ================================================ --- layout: default title: "Map Reduce" parent: "Design Pattern" nav_order: 4 --- # Map Reduce MapReduce is a design pattern suitable when you have either: - Large input data (e.g., multiple files to process), or - Large output data (e.g., multiple forms to fill) and there is a logical way to break the task into smaller, ideally independent parts.
You first break down the task using [BatchNode](../core_abstraction/batch.md) in the map phase, followed by aggregation in the reduce phase. ### Example: Document Summarization ```python class SummarizeAllFiles(BatchNode): def prep(self, shared): files_dict = shared["files"] # e.g. 10 files return list(files_dict.items()) # [("file1.txt", "aaa..."), ("file2.txt", "bbb..."), ...] def exec(self, one_file): filename, file_content = one_file summary_text = call_llm(f"Summarize the following file:\n{file_content}") return (filename, summary_text) def post(self, shared, prep_res, exec_res_list): shared["file_summaries"] = dict(exec_res_list) class CombineSummaries(Node): def prep(self, shared): return shared["file_summaries"] def exec(self, file_summaries): # format as: "File1: summary\nFile2: summary...\n" text_list = [] for fname, summ in file_summaries.items(): text_list.append(f"{fname} summary:\n{summ}\n") big_text = "\n---\n".join(text_list) return call_llm(f"Combine these file summaries into one final summary:\n{big_text}") def post(self, shared, prep_res, final_summary): shared["all_files_summary"] = final_summary batch_node = SummarizeAllFiles() combine_node = CombineSummaries() batch_node >> combine_node flow = Flow(start=batch_node) shared = { "files": { "file1.txt": "Alice was beginning to get very tired of sitting by her sister...", "file2.txt": "Some other interesting text ...", # ... } } flow.run(shared) print("Individual Summaries:", shared["file_summaries"]) print("\nFinal Summary:\n", shared["all_files_summary"]) ``` ================================================ File: docs/design_pattern/rag.md ================================================ --- layout: default title: "RAG" parent: "Design Pattern" nav_order: 3 --- # RAG (Retrieval Augmented Generation) For certain LLM tasks like answering questions, providing relevant context is essential. One common architecture is a **two-stage** RAG pipeline:
1. **Offline stage**: Preprocess and index documents ("building the index"). 2. **Online stage**: Given a question, generate answers by retrieving the most relevant context. --- ## Stage 1: Offline Indexing We create three Nodes: 1. `ChunkDocs` – [chunks](../utility_function/chunking.md) raw text. 2. `EmbedDocs` – [embeds](../utility_function/embedding.md) each chunk. 3. `StoreIndex` – stores embeddings into a [vector database](../utility_function/vector.md). ```python class ChunkDocs(BatchNode): def prep(self, shared): # A list of file paths in shared["files"]. We process each file. return shared["files"] def exec(self, filepath): # read file content. In real usage, do error handling. with open(filepath, "r", encoding="utf-8") as f: text = f.read() # chunk by 100 chars each chunks = [] size = 100 for i in range(0, len(text), size): chunks.append(text[i : i + size]) return chunks def post(self, shared, prep_res, exec_res_list): # exec_res_list is a list of chunk-lists, one per file. # flatten them all into a single list of chunks. all_chunks = [] for chunk_list in exec_res_list: all_chunks.extend(chunk_list) shared["all_chunks"] = all_chunks class EmbedDocs(BatchNode): def prep(self, shared): return shared["all_chunks"] def exec(self, chunk): return get_embedding(chunk) def post(self, shared, prep_res, exec_res_list): # Store the list of embeddings. shared["all_embeds"] = exec_res_list print(f"Total embeddings: {len(exec_res_list)}") class StoreIndex(Node): def prep(self, shared): # We'll read all embeds from shared. return shared["all_embeds"] def exec(self, all_embeds): # Create a vector index (faiss or other DB in real usage). index = create_index(all_embeds) return index def post(self, shared, prep_res, index): shared["index"] = index # Wire them in sequence chunk_node = ChunkDocs() embed_node = EmbedDocs() store_node = StoreIndex() chunk_node >> embed_node >> store_node OfflineFlow = Flow(start=chunk_node) ``` Usage example: ```python shared = { "files": ["doc1.txt", "doc2.txt"], # any text files } OfflineFlow.run(shared) ``` --- ## Stage 2: Online Query & Answer We have 3 nodes: 1. `EmbedQuery` – embeds the user’s question. 2. `RetrieveDocs` – retrieves top chunk from the index. 3. `GenerateAnswer` – calls the LLM with the question + chunk to produce the final answer. ```python class EmbedQuery(Node): def prep(self, shared): return shared["question"] def exec(self, question): return get_embedding(question) def post(self, shared, prep_res, q_emb): shared["q_emb"] = q_emb class RetrieveDocs(Node): def prep(self, shared): # We'll need the query embedding, plus the offline index/chunks return shared["q_emb"], shared["index"], shared["all_chunks"] def exec(self, inputs): q_emb, index, chunks = inputs I, D = search_index(index, q_emb, top_k=1) best_id = I[0][0] relevant_chunk = chunks[best_id] return relevant_chunk def post(self, shared, prep_res, relevant_chunk): shared["retrieved_chunk"] = relevant_chunk print("Retrieved chunk:", relevant_chunk[:60], "...") class GenerateAnswer(Node): def prep(self, shared): return shared["question"], shared["retrieved_chunk"] def exec(self, inputs): question, chunk = inputs prompt = f"Question: {question}\nContext: {chunk}\nAnswer:" return call_llm(prompt) def post(self, shared, prep_res, answer): shared["answer"] = answer print("Answer:", answer) embed_qnode = EmbedQuery() retrieve_node = RetrieveDocs() generate_node = GenerateAnswer() embed_qnode >> retrieve_node >> generate_node OnlineFlow = Flow(start=embed_qnode) ``` Usage example: ```python # Suppose we already ran OfflineFlow and have: # shared["all_chunks"], shared["index"], etc. shared["question"] = "Why do people like cats?" OnlineFlow.run(shared) # final answer in shared["answer"] ``` ================================================ File: docs/design_pattern/structure.md ================================================ --- layout: default title: "Structured Output" parent: "Design Pattern" nav_order: 5 --- # Structured Output In many use cases, you may want the LLM to output a specific structure, such as a list or a dictionary with predefined keys. There are several approaches to achieve a structured output: - **Prompting** the LLM to strictly return a defined structure. - Using LLMs that natively support **schema enforcement**. - **Post-processing** the LLM's response to extract structured content. In practice, **Prompting** is simple and reliable for modern LLMs. ### Example Use Cases - Extracting Key Information ```yaml product: name: Widget Pro price: 199.99 description: | A high-quality widget designed for professionals. Recommended for advanced users. ``` - Summarizing Documents into Bullet Points ```yaml summary: - This product is easy to use. - It is cost-effective. - Suitable for all skill levels. ``` - Generating Configuration Files ```yaml server: host: 127.0.0.1 port: 8080 ssl: true ``` ## Prompt Engineering When prompting the LLM to produce **structured** output: 1. **Wrap** the structure in code fences (e.g., `yaml`). 2. **Validate** that all required fields exist (and let `Node` handles retry). ### Example Text Summarization ```python class SummarizeNode(Node): def exec(self, prep_res): # Suppose `prep_res` is the text to summarize. prompt = f""" Please summarize the following text as YAML, with exactly 3 bullet points {prep_res} Now, output: ```yaml summary: - bullet 1 - bullet 2 - bullet 3 ```""" response = call_llm(prompt) yaml_str = response.split("```yaml")[1].split("```")[0].strip() import yaml structured_result = yaml.safe_load(yaml_str) assert "summary" in structured_result assert isinstance(structured_result["summary"], list) return structured_result ``` > Besides using `assert` statements, another popular way to validate schemas is [Pydantic](https://github.com/pydantic/pydantic) {: .note } ### Why YAML instead of JSON? Current LLMs struggle with escaping. YAML is easier with strings since they don't always need quotes. **In JSON** ```json { "dialogue": "Alice said: \"Hello Bob.\\nHow are you?\\nI am good.\"" } ``` - Every double quote inside the string must be escaped with `\"`. - Each newline in the dialogue must be represented as `\n`. **In YAML** ```yaml dialogue: | Alice said: "Hello Bob. How are you? I am good." ``` - No need to escape interior quotes—just place the entire text under a block literal (`|`). - Newlines are naturally preserved without needing `\n`. ================================================ File: docs/design_pattern/workflow.md ================================================ --- layout: default title: "Workflow" parent: "Design Pattern" nav_order: 2 --- # Workflow Many real-world tasks are too complex for one LLM call. The solution is to **Task Decomposition**: decompose them into a [chain](../core_abstraction/flow.md) of multiple Nodes.
> - You don't want to make each task **too coarse**, because it may be *too complex for one LLM call*. > - You don't want to make each task **too granular**, because then *the LLM call doesn't have enough context* and results are *not consistent across nodes*. > > You usually need multiple *iterations* to find the *sweet spot*. If the task has too many *edge cases*, consider using [Agents](./agent.md). {: .best-practice } ### Example: Article Writing ```python class GenerateOutline(Node): def prep(self, shared): return shared["topic"] def exec(self, topic): return call_llm(f"Create a detailed outline for an article about {topic}") def post(self, shared, prep_res, exec_res): shared["outline"] = exec_res class WriteSection(Node): def prep(self, shared): return shared["outline"] def exec(self, outline): return call_llm(f"Write content based on this outline: {outline}") def post(self, shared, prep_res, exec_res): shared["draft"] = exec_res class ReviewAndRefine(Node): def prep(self, shared): return shared["draft"] def exec(self, draft): return call_llm(f"Review and improve this draft: {draft}") def post(self, shared, prep_res, exec_res): shared["final_article"] = exec_res # Connect nodes outline = GenerateOutline() write = WriteSection() review = ReviewAndRefine() outline >> write >> review # Create and run flow writing_flow = Flow(start=outline) shared = {"topic": "AI Safety"} writing_flow.run(shared) ``` For *dynamic cases*, consider using [Agents](./agent.md). ================================================ File: docs/utility_function/llm.md ================================================ --- layout: default title: "LLM Wrapper" parent: "Utility Function" nav_order: 1 --- # LLM Wrappers Check out libraries like [litellm](https://github.com/BerriAI/litellm). Here, we provide some minimal example implementations: 1. OpenAI ```python def call_llm(prompt): from openai import OpenAI client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content # Example usage call_llm("How are you?") ``` > Store the API key in an environment variable like OPENAI_API_KEY for security. {: .best-practice } 2. Claude (Anthropic) ```python def call_llm(prompt): from anthropic import Anthropic client = Anthropic(api_key="YOUR_API_KEY_HERE") response = client.messages.create( model="claude-2", messages=[{"role": "user", "content": prompt}], max_tokens=100 ) return response.content ``` 3. Google (Generative AI Studio / PaLM API) ```python def call_llm(prompt): import google.generativeai as genai genai.configure(api_key="YOUR_API_KEY_HERE") response = genai.generate_text( model="models/text-bison-001", prompt=prompt ) return response.result ``` 4. Azure (Azure OpenAI) ```python def call_llm(prompt): from openai import AzureOpenAI client = AzureOpenAI( azure_endpoint="https://.openai.azure.com/", api_key="YOUR_API_KEY_HERE", api_version="2023-05-15" ) r = client.chat.completions.create( model="", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content ``` 5. Ollama (Local LLM) ```python def call_llm(prompt): from ollama import chat response = chat( model="llama2", messages=[{"role": "user", "content": prompt}] ) return response.message.content ``` ## Improvements Feel free to enhance your `call_llm` function as needed. Here are examples: - Handle chat history: ```python def call_llm(messages): from openai import OpenAI client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=messages ) return r.choices[0].message.content ``` - Add in-memory caching ```python from functools import lru_cache @lru_cache(maxsize=1000) def call_llm(prompt): # Your implementation here pass ``` > ⚠️ Caching conflicts with Node retries, as retries yield the same result. > > To address this, you could use cached results only if not retried. {: .warning } ```python from functools import lru_cache @lru_cache(maxsize=1000) def cached_call(prompt): pass def call_llm(prompt, use_cache): if use_cache: return cached_call(prompt) # Call the underlying function directly return cached_call.__wrapped__(prompt) class SummarizeNode(Node): def exec(self, text): return call_llm(f"Summarize: {text}", self.cur_retry==0) ``` - Enable logging: ```python def call_llm(prompt): import logging logging.info(f"Prompt: {prompt}") response = ... # Your implementation here logging.info(f"Response: {response}") return response ``` ================================================ FILE: .cursorrules ================================================ --- layout: default title: "Agentic Coding" --- # Agentic Coding: Humans Design, Agents code! > If you are an AI agents involved in building LLM Systems, read this guide **VERY, VERY** carefully! This is the most important chapter in the entire document. Throughout development, you should always (1) start with a small and simple solution, (2) design at a high level (`docs/design.md`) before implementation, and (3) frequently ask humans for feedback and clarification. {: .warning } ## Agentic Coding Steps Agentic Coding should be a collaboration between Human System Design and Agent Implementation: | Steps | Human | AI | Comment | |:-----------------------|:----------:|:---------:|:------------------------------------------------------------------------| | 1. Requirements | ★★★ High | ★☆☆ Low | Humans understand the requirements and context. | | 2. Flow | ★★☆ Medium | ★★☆ Medium | Humans specify the high-level design, and the AI fills in the details. | | 3. Utilities | ★★☆ Medium | ★★☆ Medium | Humans provide available external APIs and integrations, and the AI helps with implementation. | | 4. Node | ★☆☆ Low | ★★★ High | The AI helps design the node types and data handling based on the flow. | | 5. Implementation | ★☆☆ Low | ★★★ High | The AI implements the flow based on the design. | | 6. Optimization | ★★☆ Medium | ★★☆ Medium | Humans evaluate the results, and the AI helps optimize. | | 7. Reliability | ★☆☆ Low | ★★★ High | The AI writes test cases and addresses corner cases. | 1. **Requirements**: Clarify the requirements for your project, and evaluate whether an AI system is a good fit. - Understand AI systems' strengths and limitations: - **Good for**: Routine tasks requiring common sense (filling forms, replying to emails) - **Good for**: Creative tasks with well-defined inputs (building slides, writing SQL) - **Not good for**: Ambiguous problems requiring complex decision-making (business strategy, startup planning) - **Keep It User-Centric:** Explain the "problem" from the user's perspective rather than just listing features. - **Balance complexity vs. impact**: Aim to deliver the highest value features with minimal complexity early. 2. **Flow Design**: Outline at a high level, describe how your AI system orchestrates nodes. - Identify applicable design patterns (e.g., [Map Reduce](./design_pattern/mapreduce.md), [Agent](./design_pattern/agent.md), [RAG](./design_pattern/rag.md)). - For each node in the flow, start with a high-level one-line description of what it does. - If using **Map Reduce**, specify how to map (what to split) and how to reduce (how to combine). - If using **Agent**, specify what are the inputs (context) and what are the possible actions. - If using **RAG**, specify what to embed, noting that there's usually both offline (indexing) and online (retrieval) workflows. - Outline the flow and draw it in a mermaid diagram. For example: ```mermaid flowchart LR start[Start] --> batch[Batch] batch --> check[Check] check -->|OK| process check -->|Error| fix[Fix] fix --> check subgraph process[Process] step1[Step 1] --> step2[Step 2] end process --> endNode[End] ``` - > **If Humans can't specify the flow, AI Agents can't automate it!** Before building an LLM system, thoroughly understand the problem and potential solution by manually solving example inputs to develop intuition. {: .best-practice } 3. **Utilities**: Based on the Flow Design, identify and implement necessary utility functions. - Think of your AI system as the brain. It needs a body—these *external utility functions*—to interact with the real world:
- Reading inputs (e.g., retrieving Slack messages, reading emails) - Writing outputs (e.g., generating reports, sending emails) - Using external tools (e.g., calling LLMs, searching the web) - **NOTE**: *LLM-based tasks* (e.g., summarizing text, analyzing sentiment) are **NOT** utility functions; rather, they are *core functions* internal in the AI system. - For each utility function, implement it and write a simple test. - Document their input/output, as well as why they are necessary. For example: - `name`: `get_embedding` (`utils/get_embedding.py`) - `input`: `str` - `output`: a vector of 3072 floats - `necessity`: Used by the second node to embed text - Example utility implementation: ```python # utils/call_llm.py from openai import OpenAI def call_llm(prompt): client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content if __name__ == "__main__": prompt = "What is the meaning of life?" print(call_llm(prompt)) ``` - > **Sometimes, design Utilies before Flow:** For example, for an LLM project to automate a legacy system, the bottleneck will likely be the available interface to that system. Start by designing the hardest utilities for interfacing, and then build the flow around them. {: .best-practice } 4. **Node Design**: Plan how each node will read and write data, and use utility functions. - One core design principle for PocketFlow is to use a [shared store](./core_abstraction/communication.md), so start with a shared store design: - For simple systems, use an in-memory dictionary. - For more complex systems or when persistence is required, use a database. - **Don't Repeat Yourself**: Use in-memory references or foreign keys. - Example shared store design: ```python shared = { "user": { "id": "user123", "context": { # Another nested dict "weather": {"temp": 72, "condition": "sunny"}, "location": "San Francisco" } }, "results": {} # Empty dict to store outputs } ``` - For each [Node](./core_abstraction/node.md), describe its type, how it reads and writes data, and which utility function it uses. Keep it specific but high-level without codes. For example: - `type`: Regular (or Batch, or Async) - `prep`: Read "text" from the shared store - `exec`: Call the embedding utility function - `post`: Write "embedding" to the shared store 5. **Implementation**: Implement the initial nodes and flows based on the design. - 🎉 If you've reached this step, humans have finished the design. Now *Agentic Coding* begins! - **"Keep it simple, stupid!"** Avoid complex features and full-scale type checking. - **FAIL FAST**! Avoid `try` logic so you can quickly identify any weak points in the system. - Add logging throughout the code to facilitate debugging. 7. **Optimization**: - **Use Intuition**: For a quick initial evaluation, human intuition is often a good start. - **Redesign Flow (Back to Step 3)**: Consider breaking down tasks further, introducing agentic decisions, or better managing input contexts. - If your flow design is already solid, move on to micro-optimizations: - **Prompt Engineering**: Use clear, specific instructions with examples to reduce ambiguity. - **In-Context Learning**: Provide robust examples for tasks that are difficult to specify with instructions alone. - > **You'll likely iterate a lot!** Expect to repeat Steps 3–6 hundreds of times. > >
{: .best-practice } 8. **Reliability** - **Node Retries**: Add checks in the node `exec` to ensure outputs meet requirements, and consider increasing `max_retries` and `wait` times. - **Logging and Visualization**: Maintain logs of all attempts and visualize node results for easier debugging. - **Self-Evaluation**: Add a separate node (powered by an LLM) to review outputs when results are uncertain. ## Example LLM Project File Structure ``` my_project/ ├── main.py ├── nodes.py ├── flow.py ├── utils/ │ ├── __init__.py │ ├── call_llm.py │ └── search_web.py ├── requirements.txt └── docs/ └── design.md ``` - **`docs/design.md`**: Contains project documentation for each step above. This should be *high-level* and *no-code*. - **`utils/`**: Contains all utility functions. - It's recommended to dedicate one Python file to each API call, for example `call_llm.py` or `search_web.py`. - Each file should also include a `main()` function to try that API call - **`nodes.py`**: Contains all the node definitions. ```python # nodes.py from pocketflow import Node from utils.call_llm import call_llm class GetQuestionNode(Node): def exec(self, _): # Get question directly from user input user_question = input("Enter your question: ") return user_question def post(self, shared, prep_res, exec_res): # Store the user's question shared["question"] = exec_res return "default" # Go to the next node class AnswerNode(Node): def prep(self, shared): # Read question from shared return shared["question"] def exec(self, question): # Call LLM to get the answer return call_llm(question) def post(self, shared, prep_res, exec_res): # Store the answer in shared shared["answer"] = exec_res ``` - **`flow.py`**: Implements functions that create flows by importing node definitions and connecting them. ```python # flow.py from pocketflow import Flow from nodes import GetQuestionNode, AnswerNode def create_qa_flow(): """Create and return a question-answering flow.""" # Create nodes get_question_node = GetQuestionNode() answer_node = AnswerNode() # Connect nodes in sequence get_question_node >> answer_node # Create flow starting with input node return Flow(start=get_question_node) ``` - **`main.py`**: Serves as the project's entry point. ```python # main.py from flow import create_qa_flow # Example main function # Please replace this with your own main function def main(): shared = { "question": None, # Will be populated by GetQuestionNode from user input "answer": None # Will be populated by AnswerNode } # Create the flow and run it qa_flow = create_qa_flow() qa_flow.run(shared) print(f"Question: {shared['question']}") print(f"Answer: {shared['answer']}") if __name__ == "__main__": main() ``` ================================================ File: docs/index.md ================================================ --- layout: default title: "Home" nav_order: 1 --- # Pocket Flow A [100-line](https://github.com/the-pocket/PocketFlow/blob/main/pocketflow/__init__.py) minimalist LLM framework for *Agents, Task Decomposition, RAG, etc*. - **Lightweight**: Just the core graph abstraction in 100 lines. ZERO dependencies, and vendor lock-in. - **Expressive**: Everything you love from larger frameworks—([Multi-](./design_pattern/multi_agent.html))[Agents](./design_pattern/agent.html), [Workflow](./design_pattern/workflow.html), [RAG](./design_pattern/rag.html), and more. - **Agentic-Coding**: Intuitive enough for AI agents to help humans build complex LLM applications.
## Core Abstraction We model the LLM workflow as a **Graph + Shared Store**: - [Node](./core_abstraction/node.md) handles simple (LLM) tasks. - [Flow](./core_abstraction/flow.md) connects nodes through **Actions** (labeled edges). - [Shared Store](./core_abstraction/communication.md) enables communication between nodes within flows. - [Batch](./core_abstraction/batch.md) nodes/flows allow for data-intensive tasks. - [Async](./core_abstraction/async.md) nodes/flows allow waiting for asynchronous tasks. - [(Advanced) Parallel](./core_abstraction/parallel.md) nodes/flows handle I/O-bound tasks.
## Design Pattern From there, it’s easy to implement popular design patterns: - [Agent](./design_pattern/agent.md) autonomously makes decisions. - [Workflow](./design_pattern/workflow.md) chains multiple tasks into pipelines. - [RAG](./design_pattern/rag.md) integrates data retrieval with generation. - [Map Reduce](./design_pattern/mapreduce.md) splits data tasks into Map and Reduce steps. - [Structured Output](./design_pattern/structure.md) formats outputs consistently. - [(Advanced) Multi-Agents](./design_pattern/multi_agent.md) coordinate multiple agents.
## Utility Function We **do not** provide built-in utilities. Instead, we offer *examples*—please *implement your own*: - [LLM Wrapper](./utility_function/llm.md) - [Viz and Debug](./utility_function/viz.md) - [Web Search](./utility_function/websearch.md) - [Chunking](./utility_function/chunking.md) - [Embedding](./utility_function/embedding.md) - [Vector Databases](./utility_function/vector.md) - [Text-to-Speech](./utility_function/text_to_speech.md) **Why not built-in?**: I believe it's a *bad practice* for vendor-specific APIs in a general framework: - *API Volatility*: Frequent changes lead to heavy maintenance for hardcoded APIs. - *Flexibility*: You may want to switch vendors, use fine-tuned models, or run them locally. - *Optimizations*: Prompt caching, batching, and streaming are easier without vendor lock-in. ## Ready to build your Apps? Check out [Agentic Coding Guidance](./guide.md), the fastest way to develop LLM projects with Pocket Flow! ================================================ File: docs/core_abstraction/async.md ================================================ --- layout: default title: "(Advanced) Async" parent: "Core Abstraction" nav_order: 5 --- # (Advanced) Async **Async** Nodes implement `prep_async()`, `exec_async()`, `exec_fallback_async()`, and/or `post_async()`. This is useful for: 1. **prep_async()**: For *fetching/reading data (files, APIs, DB)* in an I/O-friendly way. 2. **exec_async()**: Typically used for async LLM calls. 3. **post_async()**: For *awaiting user feedback*, *coordinating across multi-agents* or any additional async steps after `exec_async()`. **Note**: `AsyncNode` must be wrapped in `AsyncFlow`. `AsyncFlow` can also include regular (sync) nodes. ### Example ```python class SummarizeThenVerify(AsyncNode): async def prep_async(self, shared): # Example: read a file asynchronously doc_text = await read_file_async(shared["doc_path"]) return doc_text async def exec_async(self, prep_res): # Example: async LLM call summary = await call_llm_async(f"Summarize: {prep_res}") return summary async def post_async(self, shared, prep_res, exec_res): # Example: wait for user feedback decision = await gather_user_feedback(exec_res) if decision == "approve": shared["summary"] = exec_res return "approve" return "deny" summarize_node = SummarizeThenVerify() final_node = Finalize() # Define transitions summarize_node - "approve" >> final_node summarize_node - "deny" >> summarize_node # retry flow = AsyncFlow(start=summarize_node) async def main(): shared = {"doc_path": "document.txt"} await flow.run_async(shared) print("Final Summary:", shared.get("summary")) asyncio.run(main()) ``` ================================================ File: docs/core_abstraction/batch.md ================================================ --- layout: default title: "Batch" parent: "Core Abstraction" nav_order: 4 --- # Batch **Batch** makes it easier to handle large inputs in one Node or **rerun** a Flow multiple times. Example use cases: - **Chunk-based** processing (e.g., splitting large texts). - **Iterative** processing over lists of input items (e.g., user queries, files, URLs). ## 1. BatchNode A **BatchNode** extends `Node` but changes `prep()` and `exec()`: - **`prep(shared)`**: returns an **iterable** (e.g., list, generator). - **`exec(item)`**: called **once** per item in that iterable. - **`post(shared, prep_res, exec_res_list)`**: after all items are processed, receives a **list** of results (`exec_res_list`) and returns an **Action**. ### Example: Summarize a Large File ```python class MapSummaries(BatchNode): def prep(self, shared): # Suppose we have a big file; chunk it content = shared["data"] chunk_size = 10000 chunks = [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)] return chunks def exec(self, chunk): prompt = f"Summarize this chunk in 10 words: {chunk}" summary = call_llm(prompt) return summary def post(self, shared, prep_res, exec_res_list): combined = "\n".join(exec_res_list) shared["summary"] = combined return "default" map_summaries = MapSummaries() flow = Flow(start=map_summaries) flow.run(shared) ``` --- ## 2. BatchFlow A **BatchFlow** runs a **Flow** multiple times, each time with different `params`. Think of it as a loop that replays the Flow for each parameter set. ### Example: Summarize Many Files ```python class SummarizeAllFiles(BatchFlow): def prep(self, shared): # Return a list of param dicts (one per file) filenames = list(shared["data"].keys()) # e.g., ["file1.txt", "file2.txt", ...] return [{"filename": fn} for fn in filenames] # Suppose we have a per-file Flow (e.g., load_file >> summarize >> reduce): summarize_file = SummarizeFile(start=load_file) # Wrap that flow into a BatchFlow: summarize_all_files = SummarizeAllFiles(start=summarize_file) summarize_all_files.run(shared) ``` ### Under the Hood 1. `prep(shared)` returns a list of param dicts—e.g., `[{filename: "file1.txt"}, {filename: "file2.txt"}, ...]`. 2. The **BatchFlow** loops through each dict. For each one: - It merges the dict with the BatchFlow’s own `params`. - It calls `flow.run(shared)` using the merged result. 3. This means the sub-Flow is run **repeatedly**, once for every param dict. --- ## 3. Nested or Multi-Level Batches You can nest a **BatchFlow** in another **BatchFlow**. For instance: - **Outer** batch: returns a list of diretory param dicts (e.g., `{"directory": "/pathA"}`, `{"directory": "/pathB"}`, ...). - **Inner** batch: returning a list of per-file param dicts. At each level, **BatchFlow** merges its own param dict with the parent’s. By the time you reach the **innermost** node, the final `params` is the merged result of **all** parents in the chain. This way, a nested structure can keep track of the entire context (e.g., directory + file name) at once. ```python class FileBatchFlow(BatchFlow): def prep(self, shared): directory = self.params["directory"] # e.g., files = ["file1.txt", "file2.txt", ...] files = [f for f in os.listdir(directory) if f.endswith(".txt")] return [{"filename": f} for f in files] class DirectoryBatchFlow(BatchFlow): def prep(self, shared): directories = [ "/path/to/dirA", "/path/to/dirB"] return [{"directory": d} for d in directories] # MapSummaries have params like {"directory": "/path/to/dirA", "filename": "file1.txt"} inner_flow = FileBatchFlow(start=MapSummaries()) outer_flow = DirectoryBatchFlow(start=inner_flow) ``` ================================================ File: docs/core_abstraction/communication.md ================================================ --- layout: default title: "Communication" parent: "Core Abstraction" nav_order: 3 --- # Communication Nodes and Flows **communicate** in 2 ways: 1. **Shared Store (for almost all the cases)** - A global data structure (often an in-mem dict) that all nodes can read ( `prep()`) and write (`post()`). - Great for data results, large content, or anything multiple nodes need. - You shall design the data structure and populate it ahead. - > **Separation of Concerns:** Use `Shared Store` for almost all cases to separate *Data Schema* from *Compute Logic*! This approach is both flexible and easy to manage, resulting in more maintainable code. `Params` is more a syntax sugar for [Batch](./batch.md). {: .best-practice } 2. **Params (only for [Batch](./batch.md))** - Each node has a local, ephemeral `params` dict passed in by the **parent Flow**, used as an identifier for tasks. Parameter keys and values shall be **immutable**. - Good for identifiers like filenames or numeric IDs, in Batch mode. If you know memory management, think of the **Shared Store** like a **heap** (shared by all function calls), and **Params** like a **stack** (assigned by the caller). --- ## 1. Shared Store ### Overview A shared store is typically an in-mem dictionary, like: ```python shared = {"data": {}, "summary": {}, "config": {...}, ...} ``` It can also contain local file handlers, DB connections, or a combination for persistence. We recommend deciding the data structure or DB schema first based on your app requirements. ### Example ```python class LoadData(Node): def post(self, shared, prep_res, exec_res): # We write data to shared store shared["data"] = "Some text content" return None class Summarize(Node): def prep(self, shared): # We read data from shared store return shared["data"] def exec(self, prep_res): # Call LLM to summarize prompt = f"Summarize: {prep_res}" summary = call_llm(prompt) return summary def post(self, shared, prep_res, exec_res): # We write summary to shared store shared["summary"] = exec_res return "default" load_data = LoadData() summarize = Summarize() load_data >> summarize flow = Flow(start=load_data) shared = {} flow.run(shared) ``` Here: - `LoadData` writes to `shared["data"]`. - `Summarize` reads from `shared["data"]`, summarizes, and writes to `shared["summary"]`. --- ## 2. Params **Params** let you store *per-Node* or *per-Flow* config that doesn't need to live in the shared store. They are: - **Immutable** during a Node's run cycle (i.e., they don't change mid-`prep->exec->post`). - **Set** via `set_params()`. - **Cleared** and updated each time a parent Flow calls it. > Only set the uppermost Flow params because others will be overwritten by the parent Flow. > > If you need to set child node params, see [Batch](./batch.md). {: .warning } Typically, **Params** are identifiers (e.g., file name, page number). Use them to fetch the task you assigned or write to a specific part of the shared store. ### Example ```python # 1) Create a Node that uses params class SummarizeFile(Node): def prep(self, shared): # Access the node's param filename = self.params["filename"] return shared["data"].get(filename, "") def exec(self, prep_res): prompt = f"Summarize: {prep_res}" return call_llm(prompt) def post(self, shared, prep_res, exec_res): filename = self.params["filename"] shared["summary"][filename] = exec_res return "default" # 2) Set params node = SummarizeFile() # 3) Set Node params directly (for testing) node.set_params({"filename": "doc1.txt"}) node.run(shared) # 4) Create Flow flow = Flow(start=node) # 5) Set Flow params (overwrites node params) flow.set_params({"filename": "doc2.txt"}) flow.run(shared) # The node summarizes doc2, not doc1 ``` ================================================ File: docs/core_abstraction/flow.md ================================================ --- layout: default title: "Flow" parent: "Core Abstraction" nav_order: 2 --- # Flow A **Flow** orchestrates a graph of Nodes. You can chain Nodes in a sequence or create branching depending on the **Actions** returned from each Node's `post()`. ## 1. Action-based Transitions Each Node's `post()` returns an **Action** string. By default, if `post()` doesn't return anything, we treat that as `"default"`. You define transitions with the syntax: 1. **Basic default transition**: `node_a >> node_b` This means if `node_a.post()` returns `"default"`, go to `node_b`. (Equivalent to `node_a - "default" >> node_b`) 2. **Named action transition**: `node_a - "action_name" >> node_b` This means if `node_a.post()` returns `"action_name"`, go to `node_b`. It's possible to create loops, branching, or multi-step flows. ## 2. Creating a Flow A **Flow** begins with a **start** node. You call `Flow(start=some_node)` to specify the entry point. When you call `flow.run(shared)`, it executes the start node, looks at its returned Action from `post()`, follows the transition, and continues until there's no next node. ### Example: Simple Sequence Here's a minimal flow of two nodes in a chain: ```python node_a >> node_b flow = Flow(start=node_a) flow.run(shared) ``` - When you run the flow, it executes `node_a`. - Suppose `node_a.post()` returns `"default"`. - The flow then sees `"default"` Action is linked to `node_b` and runs `node_b`. - `node_b.post()` returns `"default"` but we didn't define `node_b >> something_else`. So the flow ends there. ### Example: Branching & Looping Here's a simple expense approval flow that demonstrates branching and looping. The `ReviewExpense` node can return three possible Actions: - `"approved"`: expense is approved, move to payment processing - `"needs_revision"`: expense needs changes, send back for revision - `"rejected"`: expense is denied, finish the process We can wire them like this: ```python # Define the flow connections review - "approved" >> payment # If approved, process payment review - "needs_revision" >> revise # If needs changes, go to revision review - "rejected" >> finish # If rejected, finish the process revise >> review # After revision, go back for another review payment >> finish # After payment, finish the process flow = Flow(start=review) ``` Let's see how it flows: 1. If `review.post()` returns `"approved"`, the expense moves to the `payment` node 2. If `review.post()` returns `"needs_revision"`, it goes to the `revise` node, which then loops back to `review` 3. If `review.post()` returns `"rejected"`, it moves to the `finish` node and stops ```mermaid flowchart TD review[Review Expense] -->|approved| payment[Process Payment] review -->|needs_revision| revise[Revise Report] review -->|rejected| finish[Finish Process] revise --> review payment --> finish ``` ### Running Individual Nodes vs. Running a Flow - `node.run(shared)`: Just runs that node alone (calls `prep->exec->post()`), returns an Action. - `flow.run(shared)`: Executes from the start node, follows Actions to the next node, and so on until the flow can't continue. > `node.run(shared)` **does not** proceed to the successor. > This is mainly for debugging or testing a single node. > > Always use `flow.run(...)` in production to ensure the full pipeline runs correctly. {: .warning } ## 3. Nested Flows A **Flow** can act like a Node, which enables powerful composition patterns. This means you can: 1. Use a Flow as a Node within another Flow's transitions. 2. Combine multiple smaller Flows into a larger Flow for reuse. 3. Node `params` will be a merging of **all** parents' `params`. ### Flow's Node Methods A **Flow** is also a **Node**, so it will run `prep()` and `post()`. However: - It **won't** run `exec()`, as its main logic is to orchestrate its nodes. - `post()` always receives `None` for `exec_res` and should instead get the flow execution results from the shared store. ### Basic Flow Nesting Here's how to connect a flow to another node: ```python # Create a sub-flow node_a >> node_b subflow = Flow(start=node_a) # Connect it to another node subflow >> node_c # Create the parent flow parent_flow = Flow(start=subflow) ``` When `parent_flow.run()` executes: 1. It starts `subflow` 2. `subflow` runs through its nodes (`node_a->node_b`) 3. After `subflow` completes, execution continues to `node_c` ### Example: Order Processing Pipeline Here's a practical example that breaks down order processing into nested flows: ```python # Payment processing sub-flow validate_payment >> process_payment >> payment_confirmation payment_flow = Flow(start=validate_payment) # Inventory sub-flow check_stock >> reserve_items >> update_inventory inventory_flow = Flow(start=check_stock) # Shipping sub-flow create_label >> assign_carrier >> schedule_pickup shipping_flow = Flow(start=create_label) # Connect the flows into a main order pipeline payment_flow >> inventory_flow >> shipping_flow # Create the master flow order_pipeline = Flow(start=payment_flow) # Run the entire pipeline order_pipeline.run(shared_data) ``` This creates a clean separation of concerns while maintaining a clear execution path: ```mermaid flowchart LR subgraph order_pipeline[Order Pipeline] subgraph paymentFlow["Payment Flow"] A[Validate Payment] --> B[Process Payment] --> C[Payment Confirmation] end subgraph inventoryFlow["Inventory Flow"] D[Check Stock] --> E[Reserve Items] --> F[Update Inventory] end subgraph shippingFlow["Shipping Flow"] G[Create Label] --> H[Assign Carrier] --> I[Schedule Pickup] end paymentFlow --> inventoryFlow inventoryFlow --> shippingFlow end ``` ================================================ File: docs/core_abstraction/node.md ================================================ --- layout: default title: "Node" parent: "Core Abstraction" nav_order: 1 --- # Node A **Node** is the smallest building block. Each Node has 3 steps `prep->exec->post`:
1. `prep(shared)` - **Read and preprocess data** from `shared` store. - Examples: *query DB, read files, or serialize data into a string*. - Return `prep_res`, which is used by `exec()` and `post()`. 2. `exec(prep_res)` - **Execute compute logic**, with optional retries and error handling (below). - Examples: *(mostly) LLM calls, remote APIs, tool use*. - ⚠️ This shall be only for compute and **NOT** access `shared`. - ⚠️ If retries enabled, ensure idempotent implementation. - Return `exec_res`, which is passed to `post()`. 3. `post(shared, prep_res, exec_res)` - **Postprocess and write data** back to `shared`. - Examples: *update DB, change states, log results*. - **Decide the next action** by returning a *string* (`action = "default"` if *None*). > **Why 3 steps?** To enforce the principle of *separation of concerns*. The data storage and data processing are operated separately. > > All steps are *optional*. E.g., you can only implement `prep` and `post` if you just need to process data. {: .note } ### Fault Tolerance & Retries You can **retry** `exec()` if it raises an exception via two parameters when define the Node: - `max_retries` (int): Max times to run `exec()`. The default is `1` (**no** retry). - `wait` (int): The time to wait (in **seconds**) before next retry. By default, `wait=0` (no waiting). `wait` is helpful when you encounter rate-limits or quota errors from your LLM provider and need to back off. ```python my_node = SummarizeFile(max_retries=3, wait=10) ``` When an exception occurs in `exec()`, the Node automatically retries until: - It either succeeds, or - The Node has retried `max_retries - 1` times already and fails on the last attempt. You can get the current retry times (0-based) from `self.cur_retry`. ```python class RetryNode(Node): def exec(self, prep_res): print(f"Retry {self.cur_retry} times") raise Exception("Failed") ``` ### Graceful Fallback To **gracefully handle** the exception (after all retries) rather than raising it, override: ```python def exec_fallback(self, prep_res, exc): raise exc ``` By default, it just re-raises exception. But you can return a fallback result instead, which becomes the `exec_res` passed to `post()`. ### Example: Summarize file ```python class SummarizeFile(Node): def prep(self, shared): return shared["data"] def exec(self, prep_res): if not prep_res: return "Empty file content" prompt = f"Summarize this text in 10 words: {prep_res}" summary = call_llm(prompt) # might fail return summary def exec_fallback(self, prep_res, exc): # Provide a simple fallback instead of crashing return "There was an error processing your request." def post(self, shared, prep_res, exec_res): shared["summary"] = exec_res # Return "default" by not returning summarize_node = SummarizeFile(max_retries=3) # node.run() calls prep->exec->post # If exec() fails, it retries up to 3 times before calling exec_fallback() action_result = summarize_node.run(shared) print("Action returned:", action_result) # "default" print("Summary stored:", shared["summary"]) ``` ================================================ File: docs/core_abstraction/parallel.md ================================================ --- layout: default title: "(Advanced) Parallel" parent: "Core Abstraction" nav_order: 6 --- # (Advanced) Parallel **Parallel** Nodes and Flows let you run multiple **Async** Nodes and Flows **concurrently**—for example, summarizing multiple texts at once. This can improve performance by overlapping I/O and compute. > Because of Python’s GIL, parallel nodes and flows can’t truly parallelize CPU-bound tasks (e.g., heavy numerical computations). However, they excel at overlapping I/O-bound work—like LLM calls, database queries, API requests, or file I/O. {: .warning } > - **Ensure Tasks Are Independent**: If each item depends on the output of a previous item, **do not** parallelize. > > - **Beware of Rate Limits**: Parallel calls can **quickly** trigger rate limits on LLM services. You may need a **throttling** mechanism (e.g., semaphores or sleep intervals). > > - **Consider Single-Node Batch APIs**: Some LLMs offer a **batch inference** API where you can send multiple prompts in a single call. This is more complex to implement but can be more efficient than launching many parallel requests and mitigates rate limits. {: .best-practice } ## AsyncParallelBatchNode Like **AsyncBatchNode**, but run `exec_async()` in **parallel**: ```python class ParallelSummaries(AsyncParallelBatchNode): async def prep_async(self, shared): # e.g., multiple texts return shared["texts"] async def exec_async(self, text): prompt = f"Summarize: {text}" return await call_llm_async(prompt) async def post_async(self, shared, prep_res, exec_res_list): shared["summary"] = "\n\n".join(exec_res_list) return "default" node = ParallelSummaries() flow = AsyncFlow(start=node) ``` ## AsyncParallelBatchFlow Parallel version of **BatchFlow**. Each iteration of the sub-flow runs **concurrently** using different parameters: ```python class SummarizeMultipleFiles(AsyncParallelBatchFlow): async def prep_async(self, shared): return [{"filename": f} for f in shared["files"]] sub_flow = AsyncFlow(start=LoadAndSummarizeFile()) parallel_flow = SummarizeMultipleFiles(start=sub_flow) await parallel_flow.run_async(shared) ``` ================================================ File: docs/design_pattern/agent.md ================================================ --- layout: default title: "Agent" parent: "Design Pattern" nav_order: 1 --- # Agent Agent is a powerful design pattern in which nodes can take dynamic actions based on the context.
## Implement Agent with Graph 1. **Context and Action:** Implement nodes that supply context and perform actions. 2. **Branching:** Use branching to connect each action node to an agent node. Use action to allow the agent to direct the [flow](../core_abstraction/flow.md) between nodes—and potentially loop back for multi-step. 3. **Agent Node:** Provide a prompt to decide action—for example: ```python f""" ### CONTEXT Task: {task_description} Previous Actions: {previous_actions} Current State: {current_state} ### ACTION SPACE [1] search Description: Use web search to get results Parameters: - query (str): What to search for [2] answer Description: Conclude based on the results Parameters: - result (str): Final answer to provide ### NEXT ACTION Decide the next action based on the current context and available action space. Return your response in the following format: ```yaml thinking: | action: parameters: : ```""" ``` The core of building **high-performance** and **reliable** agents boils down to: 1. **Context Management:** Provide *relevant, minimal context.* For example, rather than including an entire chat history, retrieve the most relevant via [RAG](./rag.md). Even with larger context windows, LLMs still fall victim to ["lost in the middle"](https://arxiv.org/abs/2307.03172), overlooking mid-prompt content. 2. **Action Space:** Provide *a well-structured and unambiguous* set of actions—avoiding overlap like separate `read_databases` or `read_csvs`. Instead, import CSVs into the database. ## Example Good Action Design - **Incremental:** Feed content in manageable chunks (500 lines or 1 page) instead of all at once. - **Overview-zoom-in:** First provide high-level structure (table of contents, summary), then allow drilling into details (raw texts). - **Parameterized/Programmable:** Instead of fixed actions, enable parameterized (columns to select) or programmable (SQL queries) actions, for example, to read CSV files. - **Backtracking:** Let the agent undo the last step instead of restarting entirely, preserving progress when encountering errors or dead ends. ## Example: Search Agent This agent: 1. Decides whether to search or answer 2. If searches, loops back to decide if more search needed 3. Answers when enough context gathered ```python class DecideAction(Node): def prep(self, shared): context = shared.get("context", "No previous search") query = shared["query"] return query, context def exec(self, inputs): query, context = inputs prompt = f""" Given input: {query} Previous search results: {context} Should I: 1) Search web for more info 2) Answer with current knowledge Output in yaml: ```yaml action: search/answer reason: why this action search_term: search phrase if action is search ```""" resp = call_llm(prompt) yaml_str = resp.split("```yaml")[1].split("```")[0].strip() result = yaml.safe_load(yaml_str) assert isinstance(result, dict) assert "action" in result assert "reason" in result assert result["action"] in ["search", "answer"] if result["action"] == "search": assert "search_term" in result return result def post(self, shared, prep_res, exec_res): if exec_res["action"] == "search": shared["search_term"] = exec_res["search_term"] return exec_res["action"] class SearchWeb(Node): def prep(self, shared): return shared["search_term"] def exec(self, search_term): return search_web(search_term) def post(self, shared, prep_res, exec_res): prev_searches = shared.get("context", []) shared["context"] = prev_searches + [ {"term": shared["search_term"], "result": exec_res} ] return "decide" class DirectAnswer(Node): def prep(self, shared): return shared["query"], shared.get("context", "") def exec(self, inputs): query, context = inputs return call_llm(f"Context: {context}\nAnswer: {query}") def post(self, shared, prep_res, exec_res): print(f"Answer: {exec_res}") shared["answer"] = exec_res # Connect nodes decide = DecideAction() search = SearchWeb() answer = DirectAnswer() decide - "search" >> search decide - "answer" >> answer search - "decide" >> decide # Loop back flow = Flow(start=decide) flow.run({"query": "Who won the Nobel Prize in Physics 2024?"}) ``` ================================================ File: docs/design_pattern/mapreduce.md ================================================ --- layout: default title: "Map Reduce" parent: "Design Pattern" nav_order: 4 --- # Map Reduce MapReduce is a design pattern suitable when you have either: - Large input data (e.g., multiple files to process), or - Large output data (e.g., multiple forms to fill) and there is a logical way to break the task into smaller, ideally independent parts.
You first break down the task using [BatchNode](../core_abstraction/batch.md) in the map phase, followed by aggregation in the reduce phase. ### Example: Document Summarization ```python class SummarizeAllFiles(BatchNode): def prep(self, shared): files_dict = shared["files"] # e.g. 10 files return list(files_dict.items()) # [("file1.txt", "aaa..."), ("file2.txt", "bbb..."), ...] def exec(self, one_file): filename, file_content = one_file summary_text = call_llm(f"Summarize the following file:\n{file_content}") return (filename, summary_text) def post(self, shared, prep_res, exec_res_list): shared["file_summaries"] = dict(exec_res_list) class CombineSummaries(Node): def prep(self, shared): return shared["file_summaries"] def exec(self, file_summaries): # format as: "File1: summary\nFile2: summary...\n" text_list = [] for fname, summ in file_summaries.items(): text_list.append(f"{fname} summary:\n{summ}\n") big_text = "\n---\n".join(text_list) return call_llm(f"Combine these file summaries into one final summary:\n{big_text}") def post(self, shared, prep_res, final_summary): shared["all_files_summary"] = final_summary batch_node = SummarizeAllFiles() combine_node = CombineSummaries() batch_node >> combine_node flow = Flow(start=batch_node) shared = { "files": { "file1.txt": "Alice was beginning to get very tired of sitting by her sister...", "file2.txt": "Some other interesting text ...", # ... } } flow.run(shared) print("Individual Summaries:", shared["file_summaries"]) print("\nFinal Summary:\n", shared["all_files_summary"]) ``` ================================================ File: docs/design_pattern/rag.md ================================================ --- layout: default title: "RAG" parent: "Design Pattern" nav_order: 3 --- # RAG (Retrieval Augmented Generation) For certain LLM tasks like answering questions, providing relevant context is essential. One common architecture is a **two-stage** RAG pipeline:
1. **Offline stage**: Preprocess and index documents ("building the index"). 2. **Online stage**: Given a question, generate answers by retrieving the most relevant context. --- ## Stage 1: Offline Indexing We create three Nodes: 1. `ChunkDocs` – [chunks](../utility_function/chunking.md) raw text. 2. `EmbedDocs` – [embeds](../utility_function/embedding.md) each chunk. 3. `StoreIndex` – stores embeddings into a [vector database](../utility_function/vector.md). ```python class ChunkDocs(BatchNode): def prep(self, shared): # A list of file paths in shared["files"]. We process each file. return shared["files"] def exec(self, filepath): # read file content. In real usage, do error handling. with open(filepath, "r", encoding="utf-8") as f: text = f.read() # chunk by 100 chars each chunks = [] size = 100 for i in range(0, len(text), size): chunks.append(text[i : i + size]) return chunks def post(self, shared, prep_res, exec_res_list): # exec_res_list is a list of chunk-lists, one per file. # flatten them all into a single list of chunks. all_chunks = [] for chunk_list in exec_res_list: all_chunks.extend(chunk_list) shared["all_chunks"] = all_chunks class EmbedDocs(BatchNode): def prep(self, shared): return shared["all_chunks"] def exec(self, chunk): return get_embedding(chunk) def post(self, shared, prep_res, exec_res_list): # Store the list of embeddings. shared["all_embeds"] = exec_res_list print(f"Total embeddings: {len(exec_res_list)}") class StoreIndex(Node): def prep(self, shared): # We'll read all embeds from shared. return shared["all_embeds"] def exec(self, all_embeds): # Create a vector index (faiss or other DB in real usage). index = create_index(all_embeds) return index def post(self, shared, prep_res, index): shared["index"] = index # Wire them in sequence chunk_node = ChunkDocs() embed_node = EmbedDocs() store_node = StoreIndex() chunk_node >> embed_node >> store_node OfflineFlow = Flow(start=chunk_node) ``` Usage example: ```python shared = { "files": ["doc1.txt", "doc2.txt"], # any text files } OfflineFlow.run(shared) ``` --- ## Stage 2: Online Query & Answer We have 3 nodes: 1. `EmbedQuery` – embeds the user’s question. 2. `RetrieveDocs` – retrieves top chunk from the index. 3. `GenerateAnswer` – calls the LLM with the question + chunk to produce the final answer. ```python class EmbedQuery(Node): def prep(self, shared): return shared["question"] def exec(self, question): return get_embedding(question) def post(self, shared, prep_res, q_emb): shared["q_emb"] = q_emb class RetrieveDocs(Node): def prep(self, shared): # We'll need the query embedding, plus the offline index/chunks return shared["q_emb"], shared["index"], shared["all_chunks"] def exec(self, inputs): q_emb, index, chunks = inputs I, D = search_index(index, q_emb, top_k=1) best_id = I[0][0] relevant_chunk = chunks[best_id] return relevant_chunk def post(self, shared, prep_res, relevant_chunk): shared["retrieved_chunk"] = relevant_chunk print("Retrieved chunk:", relevant_chunk[:60], "...") class GenerateAnswer(Node): def prep(self, shared): return shared["question"], shared["retrieved_chunk"] def exec(self, inputs): question, chunk = inputs prompt = f"Question: {question}\nContext: {chunk}\nAnswer:" return call_llm(prompt) def post(self, shared, prep_res, answer): shared["answer"] = answer print("Answer:", answer) embed_qnode = EmbedQuery() retrieve_node = RetrieveDocs() generate_node = GenerateAnswer() embed_qnode >> retrieve_node >> generate_node OnlineFlow = Flow(start=embed_qnode) ``` Usage example: ```python # Suppose we already ran OfflineFlow and have: # shared["all_chunks"], shared["index"], etc. shared["question"] = "Why do people like cats?" OnlineFlow.run(shared) # final answer in shared["answer"] ``` ================================================ File: docs/design_pattern/structure.md ================================================ --- layout: default title: "Structured Output" parent: "Design Pattern" nav_order: 5 --- # Structured Output In many use cases, you may want the LLM to output a specific structure, such as a list or a dictionary with predefined keys. There are several approaches to achieve a structured output: - **Prompting** the LLM to strictly return a defined structure. - Using LLMs that natively support **schema enforcement**. - **Post-processing** the LLM's response to extract structured content. In practice, **Prompting** is simple and reliable for modern LLMs. ### Example Use Cases - Extracting Key Information ```yaml product: name: Widget Pro price: 199.99 description: | A high-quality widget designed for professionals. Recommended for advanced users. ``` - Summarizing Documents into Bullet Points ```yaml summary: - This product is easy to use. - It is cost-effective. - Suitable for all skill levels. ``` - Generating Configuration Files ```yaml server: host: 127.0.0.1 port: 8080 ssl: true ``` ## Prompt Engineering When prompting the LLM to produce **structured** output: 1. **Wrap** the structure in code fences (e.g., `yaml`). 2. **Validate** that all required fields exist (and let `Node` handles retry). ### Example Text Summarization ```python class SummarizeNode(Node): def exec(self, prep_res): # Suppose `prep_res` is the text to summarize. prompt = f""" Please summarize the following text as YAML, with exactly 3 bullet points {prep_res} Now, output: ```yaml summary: - bullet 1 - bullet 2 - bullet 3 ```""" response = call_llm(prompt) yaml_str = response.split("```yaml")[1].split("```")[0].strip() import yaml structured_result = yaml.safe_load(yaml_str) assert "summary" in structured_result assert isinstance(structured_result["summary"], list) return structured_result ``` > Besides using `assert` statements, another popular way to validate schemas is [Pydantic](https://github.com/pydantic/pydantic) {: .note } ### Why YAML instead of JSON? Current LLMs struggle with escaping. YAML is easier with strings since they don't always need quotes. **In JSON** ```json { "dialogue": "Alice said: \"Hello Bob.\\nHow are you?\\nI am good.\"" } ``` - Every double quote inside the string must be escaped with `\"`. - Each newline in the dialogue must be represented as `\n`. **In YAML** ```yaml dialogue: | Alice said: "Hello Bob. How are you? I am good." ``` - No need to escape interior quotes—just place the entire text under a block literal (`|`). - Newlines are naturally preserved without needing `\n`. ================================================ File: docs/design_pattern/workflow.md ================================================ --- layout: default title: "Workflow" parent: "Design Pattern" nav_order: 2 --- # Workflow Many real-world tasks are too complex for one LLM call. The solution is to **Task Decomposition**: decompose them into a [chain](../core_abstraction/flow.md) of multiple Nodes.
> - You don't want to make each task **too coarse**, because it may be *too complex for one LLM call*. > - You don't want to make each task **too granular**, because then *the LLM call doesn't have enough context* and results are *not consistent across nodes*. > > You usually need multiple *iterations* to find the *sweet spot*. If the task has too many *edge cases*, consider using [Agents](./agent.md). {: .best-practice } ### Example: Article Writing ```python class GenerateOutline(Node): def prep(self, shared): return shared["topic"] def exec(self, topic): return call_llm(f"Create a detailed outline for an article about {topic}") def post(self, shared, prep_res, exec_res): shared["outline"] = exec_res class WriteSection(Node): def prep(self, shared): return shared["outline"] def exec(self, outline): return call_llm(f"Write content based on this outline: {outline}") def post(self, shared, prep_res, exec_res): shared["draft"] = exec_res class ReviewAndRefine(Node): def prep(self, shared): return shared["draft"] def exec(self, draft): return call_llm(f"Review and improve this draft: {draft}") def post(self, shared, prep_res, exec_res): shared["final_article"] = exec_res # Connect nodes outline = GenerateOutline() write = WriteSection() review = ReviewAndRefine() outline >> write >> review # Create and run flow writing_flow = Flow(start=outline) shared = {"topic": "AI Safety"} writing_flow.run(shared) ``` For *dynamic cases*, consider using [Agents](./agent.md). ================================================ File: docs/utility_function/llm.md ================================================ --- layout: default title: "LLM Wrapper" parent: "Utility Function" nav_order: 1 --- # LLM Wrappers Check out libraries like [litellm](https://github.com/BerriAI/litellm). Here, we provide some minimal example implementations: 1. OpenAI ```python def call_llm(prompt): from openai import OpenAI client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content # Example usage call_llm("How are you?") ``` > Store the API key in an environment variable like OPENAI_API_KEY for security. {: .best-practice } 2. Claude (Anthropic) ```python def call_llm(prompt): from anthropic import Anthropic client = Anthropic(api_key="YOUR_API_KEY_HERE") response = client.messages.create( model="claude-2", messages=[{"role": "user", "content": prompt}], max_tokens=100 ) return response.content ``` 3. Google (Generative AI Studio / PaLM API) ```python def call_llm(prompt): import google.generativeai as genai genai.configure(api_key="YOUR_API_KEY_HERE") response = genai.generate_text( model="models/text-bison-001", prompt=prompt ) return response.result ``` 4. Azure (Azure OpenAI) ```python def call_llm(prompt): from openai import AzureOpenAI client = AzureOpenAI( azure_endpoint="https://.openai.azure.com/", api_key="YOUR_API_KEY_HERE", api_version="2023-05-15" ) r = client.chat.completions.create( model="", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content ``` 5. Ollama (Local LLM) ```python def call_llm(prompt): from ollama import chat response = chat( model="llama2", messages=[{"role": "user", "content": prompt}] ) return response.message.content ``` ## Improvements Feel free to enhance your `call_llm` function as needed. Here are examples: - Handle chat history: ```python def call_llm(messages): from openai import OpenAI client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=messages ) return r.choices[0].message.content ``` - Add in-memory caching ```python from functools import lru_cache @lru_cache(maxsize=1000) def call_llm(prompt): # Your implementation here pass ``` > ⚠️ Caching conflicts with Node retries, as retries yield the same result. > > To address this, you could use cached results only if not retried. {: .warning } ```python from functools import lru_cache @lru_cache(maxsize=1000) def cached_call(prompt): pass def call_llm(prompt, use_cache): if use_cache: return cached_call(prompt) # Call the underlying function directly return cached_call.__wrapped__(prompt) class SummarizeNode(Node): def exec(self, text): return call_llm(f"Summarize: {text}", self.cur_retry==0) ``` - Enable logging: ```python def call_llm(prompt): import logging logging.info(f"Prompt: {prompt}") response = ... # Your implementation here logging.info(f"Response: {response}") return response ``` ================================================ FILE: .dockerignore ================================================ # Byte-compiled / cache files __pycache__/ *.py[cod] *.pyo *.pyd # Virtual environments venv/ env/ .venv/ .env/ # Distribution / packaging *.egg-info/ build/ dist/ # Git and other VCS .git/ .gitignore # Editor files *.swp *.swo *.bak *.tmp .DS_Store .idea/ .vscode/ # Secrets (if you’re using .env for API keys etc.) .env ================================================ FILE: .gitignore ================================================ # Dependencies node_modules/ vendor/ .pnp/ .pnp.js # Build outputs dist/ build/ out/ *.pyc __pycache__/ # Environment files .env .env.local .env.*.local .env.development .env.test .env.production # Python virtual environments .venv/ venv/ # IDE - VSCode .vscode/* !.vscode/settings.json !.vscode/tasks.json !.vscode/launch.json !.vscode/extensions.json # IDE - JetBrains .idea/ *.iml *.iws *.ipr # IDE - Eclipse .project .classpath .settings/ # Logs logs/ *.log npm-debug.log* yarn-debug.log* yarn-error.log* # Operating System .DS_Store Thumbs.db *.swp *.swo # Testing coverage/ .nyc_output/ # Temporary files *.tmp *.temp .cache/ # Compiled files *.com *.class *.dll *.exe *.o *.so # Package files *.7z *.dmg *.gz *.iso *.jar *.rar *.tar *.zip # Database *.sqlite *.sqlite3 *.db # Optional npm cache directory .npm # Optional eslint cache .eslintcache # Optional REPL history .node_repl_history # LLM cache llm_cache.json # Output files output/ # uv manage pyproject.toml uv.lock docs/*.pdf docs/design-cn.md ================================================ FILE: .windsurfrules ================================================ --- layout: default title: "Agentic Coding" --- # Agentic Coding: Humans Design, Agents code! > If you are an AI agents involved in building LLM Systems, read this guide **VERY, VERY** carefully! This is the most important chapter in the entire document. Throughout development, you should always (1) start with a small and simple solution, (2) design at a high level (`docs/design.md`) before implementation, and (3) frequently ask humans for feedback and clarification. {: .warning } ## Agentic Coding Steps Agentic Coding should be a collaboration between Human System Design and Agent Implementation: | Steps | Human | AI | Comment | |:-----------------------|:----------:|:---------:|:------------------------------------------------------------------------| | 1. Requirements | ★★★ High | ★☆☆ Low | Humans understand the requirements and context. | | 2. Flow | ★★☆ Medium | ★★☆ Medium | Humans specify the high-level design, and the AI fills in the details. | | 3. Utilities | ★★☆ Medium | ★★☆ Medium | Humans provide available external APIs and integrations, and the AI helps with implementation. | | 4. Node | ★☆☆ Low | ★★★ High | The AI helps design the node types and data handling based on the flow. | | 5. Implementation | ★☆☆ Low | ★★★ High | The AI implements the flow based on the design. | | 6. Optimization | ★★☆ Medium | ★★☆ Medium | Humans evaluate the results, and the AI helps optimize. | | 7. Reliability | ★☆☆ Low | ★★★ High | The AI writes test cases and addresses corner cases. | 1. **Requirements**: Clarify the requirements for your project, and evaluate whether an AI system is a good fit. - Understand AI systems' strengths and limitations: - **Good for**: Routine tasks requiring common sense (filling forms, replying to emails) - **Good for**: Creative tasks with well-defined inputs (building slides, writing SQL) - **Not good for**: Ambiguous problems requiring complex decision-making (business strategy, startup planning) - **Keep It User-Centric:** Explain the "problem" from the user's perspective rather than just listing features. - **Balance complexity vs. impact**: Aim to deliver the highest value features with minimal complexity early. 2. **Flow Design**: Outline at a high level, describe how your AI system orchestrates nodes. - Identify applicable design patterns (e.g., [Map Reduce](./design_pattern/mapreduce.md), [Agent](./design_pattern/agent.md), [RAG](./design_pattern/rag.md)). - For each node in the flow, start with a high-level one-line description of what it does. - If using **Map Reduce**, specify how to map (what to split) and how to reduce (how to combine). - If using **Agent**, specify what are the inputs (context) and what are the possible actions. - If using **RAG**, specify what to embed, noting that there's usually both offline (indexing) and online (retrieval) workflows. - Outline the flow and draw it in a mermaid diagram. For example: ```mermaid flowchart LR start[Start] --> batch[Batch] batch --> check[Check] check -->|OK| process check -->|Error| fix[Fix] fix --> check subgraph process[Process] step1[Step 1] --> step2[Step 2] end process --> endNode[End] ``` - > **If Humans can't specify the flow, AI Agents can't automate it!** Before building an LLM system, thoroughly understand the problem and potential solution by manually solving example inputs to develop intuition. {: .best-practice } 3. **Utilities**: Based on the Flow Design, identify and implement necessary utility functions. - Think of your AI system as the brain. It needs a body—these *external utility functions*—to interact with the real world:
- Reading inputs (e.g., retrieving Slack messages, reading emails) - Writing outputs (e.g., generating reports, sending emails) - Using external tools (e.g., calling LLMs, searching the web) - **NOTE**: *LLM-based tasks* (e.g., summarizing text, analyzing sentiment) are **NOT** utility functions; rather, they are *core functions* internal in the AI system. - For each utility function, implement it and write a simple test. - Document their input/output, as well as why they are necessary. For example: - `name`: `get_embedding` (`utils/get_embedding.py`) - `input`: `str` - `output`: a vector of 3072 floats - `necessity`: Used by the second node to embed text - Example utility implementation: ```python # utils/call_llm.py from openai import OpenAI def call_llm(prompt): client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content if __name__ == "__main__": prompt = "What is the meaning of life?" print(call_llm(prompt)) ``` - > **Sometimes, design Utilies before Flow:** For example, for an LLM project to automate a legacy system, the bottleneck will likely be the available interface to that system. Start by designing the hardest utilities for interfacing, and then build the flow around them. {: .best-practice } 4. **Node Design**: Plan how each node will read and write data, and use utility functions. - One core design principle for PocketFlow is to use a [shared store](./core_abstraction/communication.md), so start with a shared store design: - For simple systems, use an in-memory dictionary. - For more complex systems or when persistence is required, use a database. - **Don't Repeat Yourself**: Use in-memory references or foreign keys. - Example shared store design: ```python shared = { "user": { "id": "user123", "context": { # Another nested dict "weather": {"temp": 72, "condition": "sunny"}, "location": "San Francisco" } }, "results": {} # Empty dict to store outputs } ``` - For each [Node](./core_abstraction/node.md), describe its type, how it reads and writes data, and which utility function it uses. Keep it specific but high-level without codes. For example: - `type`: Regular (or Batch, or Async) - `prep`: Read "text" from the shared store - `exec`: Call the embedding utility function - `post`: Write "embedding" to the shared store 5. **Implementation**: Implement the initial nodes and flows based on the design. - 🎉 If you've reached this step, humans have finished the design. Now *Agentic Coding* begins! - **"Keep it simple, stupid!"** Avoid complex features and full-scale type checking. - **FAIL FAST**! Avoid `try` logic so you can quickly identify any weak points in the system. - Add logging throughout the code to facilitate debugging. 7. **Optimization**: - **Use Intuition**: For a quick initial evaluation, human intuition is often a good start. - **Redesign Flow (Back to Step 3)**: Consider breaking down tasks further, introducing agentic decisions, or better managing input contexts. - If your flow design is already solid, move on to micro-optimizations: - **Prompt Engineering**: Use clear, specific instructions with examples to reduce ambiguity. - **In-Context Learning**: Provide robust examples for tasks that are difficult to specify with instructions alone. - > **You'll likely iterate a lot!** Expect to repeat Steps 3–6 hundreds of times. > >
{: .best-practice } 8. **Reliability** - **Node Retries**: Add checks in the node `exec` to ensure outputs meet requirements, and consider increasing `max_retries` and `wait` times. - **Logging and Visualization**: Maintain logs of all attempts and visualize node results for easier debugging. - **Self-Evaluation**: Add a separate node (powered by an LLM) to review outputs when results are uncertain. ## Example LLM Project File Structure ``` my_project/ ├── main.py ├── nodes.py ├── flow.py ├── utils/ │ ├── __init__.py │ ├── call_llm.py │ └── search_web.py ├── requirements.txt └── docs/ └── design.md ``` - **`docs/design.md`**: Contains project documentation for each step above. This should be *high-level* and *no-code*. - **`utils/`**: Contains all utility functions. - It's recommended to dedicate one Python file to each API call, for example `call_llm.py` or `search_web.py`. - Each file should also include a `main()` function to try that API call - **`nodes.py`**: Contains all the node definitions. ```python # nodes.py from pocketflow import Node from utils.call_llm import call_llm class GetQuestionNode(Node): def exec(self, _): # Get question directly from user input user_question = input("Enter your question: ") return user_question def post(self, shared, prep_res, exec_res): # Store the user's question shared["question"] = exec_res return "default" # Go to the next node class AnswerNode(Node): def prep(self, shared): # Read question from shared return shared["question"] def exec(self, question): # Call LLM to get the answer return call_llm(question) def post(self, shared, prep_res, exec_res): # Store the answer in shared shared["answer"] = exec_res ``` - **`flow.py`**: Implements functions that create flows by importing node definitions and connecting them. ```python # flow.py from pocketflow import Flow from nodes import GetQuestionNode, AnswerNode def create_qa_flow(): """Create and return a question-answering flow.""" # Create nodes get_question_node = GetQuestionNode() answer_node = AnswerNode() # Connect nodes in sequence get_question_node >> answer_node # Create flow starting with input node return Flow(start=get_question_node) ``` - **`main.py`**: Serves as the project's entry point. ```python # main.py from flow import create_qa_flow # Example main function # Please replace this with your own main function def main(): shared = { "question": None, # Will be populated by GetQuestionNode from user input "answer": None # Will be populated by AnswerNode } # Create the flow and run it qa_flow = create_qa_flow() qa_flow.run(shared) print(f"Question: {shared['question']}") print(f"Answer: {shared['answer']}") if __name__ == "__main__": main() ``` ================================================ File: docs/index.md ================================================ --- layout: default title: "Home" nav_order: 1 --- # Pocket Flow A [100-line](https://github.com/the-pocket/PocketFlow/blob/main/pocketflow/__init__.py) minimalist LLM framework for *Agents, Task Decomposition, RAG, etc*. - **Lightweight**: Just the core graph abstraction in 100 lines. ZERO dependencies, and vendor lock-in. - **Expressive**: Everything you love from larger frameworks—([Multi-](./design_pattern/multi_agent.html))[Agents](./design_pattern/agent.html), [Workflow](./design_pattern/workflow.html), [RAG](./design_pattern/rag.html), and more. - **Agentic-Coding**: Intuitive enough for AI agents to help humans build complex LLM applications.
## Core Abstraction We model the LLM workflow as a **Graph + Shared Store**: - [Node](./core_abstraction/node.md) handles simple (LLM) tasks. - [Flow](./core_abstraction/flow.md) connects nodes through **Actions** (labeled edges). - [Shared Store](./core_abstraction/communication.md) enables communication between nodes within flows. - [Batch](./core_abstraction/batch.md) nodes/flows allow for data-intensive tasks. - [Async](./core_abstraction/async.md) nodes/flows allow waiting for asynchronous tasks. - [(Advanced) Parallel](./core_abstraction/parallel.md) nodes/flows handle I/O-bound tasks.
## Design Pattern From there, it’s easy to implement popular design patterns: - [Agent](./design_pattern/agent.md) autonomously makes decisions. - [Workflow](./design_pattern/workflow.md) chains multiple tasks into pipelines. - [RAG](./design_pattern/rag.md) integrates data retrieval with generation. - [Map Reduce](./design_pattern/mapreduce.md) splits data tasks into Map and Reduce steps. - [Structured Output](./design_pattern/structure.md) formats outputs consistently. - [(Advanced) Multi-Agents](./design_pattern/multi_agent.md) coordinate multiple agents.
## Utility Function We **do not** provide built-in utilities. Instead, we offer *examples*—please *implement your own*: - [LLM Wrapper](./utility_function/llm.md) - [Viz and Debug](./utility_function/viz.md) - [Web Search](./utility_function/websearch.md) - [Chunking](./utility_function/chunking.md) - [Embedding](./utility_function/embedding.md) - [Vector Databases](./utility_function/vector.md) - [Text-to-Speech](./utility_function/text_to_speech.md) **Why not built-in?**: I believe it's a *bad practice* for vendor-specific APIs in a general framework: - *API Volatility*: Frequent changes lead to heavy maintenance for hardcoded APIs. - *Flexibility*: You may want to switch vendors, use fine-tuned models, or run them locally. - *Optimizations*: Prompt caching, batching, and streaming are easier without vendor lock-in. ## Ready to build your Apps? Check out [Agentic Coding Guidance](./guide.md), the fastest way to develop LLM projects with Pocket Flow! ================================================ File: docs/core_abstraction/async.md ================================================ --- layout: default title: "(Advanced) Async" parent: "Core Abstraction" nav_order: 5 --- # (Advanced) Async **Async** Nodes implement `prep_async()`, `exec_async()`, `exec_fallback_async()`, and/or `post_async()`. This is useful for: 1. **prep_async()**: For *fetching/reading data (files, APIs, DB)* in an I/O-friendly way. 2. **exec_async()**: Typically used for async LLM calls. 3. **post_async()**: For *awaiting user feedback*, *coordinating across multi-agents* or any additional async steps after `exec_async()`. **Note**: `AsyncNode` must be wrapped in `AsyncFlow`. `AsyncFlow` can also include regular (sync) nodes. ### Example ```python class SummarizeThenVerify(AsyncNode): async def prep_async(self, shared): # Example: read a file asynchronously doc_text = await read_file_async(shared["doc_path"]) return doc_text async def exec_async(self, prep_res): # Example: async LLM call summary = await call_llm_async(f"Summarize: {prep_res}") return summary async def post_async(self, shared, prep_res, exec_res): # Example: wait for user feedback decision = await gather_user_feedback(exec_res) if decision == "approve": shared["summary"] = exec_res return "approve" return "deny" summarize_node = SummarizeThenVerify() final_node = Finalize() # Define transitions summarize_node - "approve" >> final_node summarize_node - "deny" >> summarize_node # retry flow = AsyncFlow(start=summarize_node) async def main(): shared = {"doc_path": "document.txt"} await flow.run_async(shared) print("Final Summary:", shared.get("summary")) asyncio.run(main()) ``` ================================================ File: docs/core_abstraction/batch.md ================================================ --- layout: default title: "Batch" parent: "Core Abstraction" nav_order: 4 --- # Batch **Batch** makes it easier to handle large inputs in one Node or **rerun** a Flow multiple times. Example use cases: - **Chunk-based** processing (e.g., splitting large texts). - **Iterative** processing over lists of input items (e.g., user queries, files, URLs). ## 1. BatchNode A **BatchNode** extends `Node` but changes `prep()` and `exec()`: - **`prep(shared)`**: returns an **iterable** (e.g., list, generator). - **`exec(item)`**: called **once** per item in that iterable. - **`post(shared, prep_res, exec_res_list)`**: after all items are processed, receives a **list** of results (`exec_res_list`) and returns an **Action**. ### Example: Summarize a Large File ```python class MapSummaries(BatchNode): def prep(self, shared): # Suppose we have a big file; chunk it content = shared["data"] chunk_size = 10000 chunks = [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)] return chunks def exec(self, chunk): prompt = f"Summarize this chunk in 10 words: {chunk}" summary = call_llm(prompt) return summary def post(self, shared, prep_res, exec_res_list): combined = "\n".join(exec_res_list) shared["summary"] = combined return "default" map_summaries = MapSummaries() flow = Flow(start=map_summaries) flow.run(shared) ``` --- ## 2. BatchFlow A **BatchFlow** runs a **Flow** multiple times, each time with different `params`. Think of it as a loop that replays the Flow for each parameter set. ### Example: Summarize Many Files ```python class SummarizeAllFiles(BatchFlow): def prep(self, shared): # Return a list of param dicts (one per file) filenames = list(shared["data"].keys()) # e.g., ["file1.txt", "file2.txt", ...] return [{"filename": fn} for fn in filenames] # Suppose we have a per-file Flow (e.g., load_file >> summarize >> reduce): summarize_file = SummarizeFile(start=load_file) # Wrap that flow into a BatchFlow: summarize_all_files = SummarizeAllFiles(start=summarize_file) summarize_all_files.run(shared) ``` ### Under the Hood 1. `prep(shared)` returns a list of param dicts—e.g., `[{filename: "file1.txt"}, {filename: "file2.txt"}, ...]`. 2. The **BatchFlow** loops through each dict. For each one: - It merges the dict with the BatchFlow’s own `params`. - It calls `flow.run(shared)` using the merged result. 3. This means the sub-Flow is run **repeatedly**, once for every param dict. --- ## 3. Nested or Multi-Level Batches You can nest a **BatchFlow** in another **BatchFlow**. For instance: - **Outer** batch: returns a list of diretory param dicts (e.g., `{"directory": "/pathA"}`, `{"directory": "/pathB"}`, ...). - **Inner** batch: returning a list of per-file param dicts. At each level, **BatchFlow** merges its own param dict with the parent’s. By the time you reach the **innermost** node, the final `params` is the merged result of **all** parents in the chain. This way, a nested structure can keep track of the entire context (e.g., directory + file name) at once. ```python class FileBatchFlow(BatchFlow): def prep(self, shared): directory = self.params["directory"] # e.g., files = ["file1.txt", "file2.txt", ...] files = [f for f in os.listdir(directory) if f.endswith(".txt")] return [{"filename": f} for f in files] class DirectoryBatchFlow(BatchFlow): def prep(self, shared): directories = [ "/path/to/dirA", "/path/to/dirB"] return [{"directory": d} for d in directories] # MapSummaries have params like {"directory": "/path/to/dirA", "filename": "file1.txt"} inner_flow = FileBatchFlow(start=MapSummaries()) outer_flow = DirectoryBatchFlow(start=inner_flow) ``` ================================================ File: docs/core_abstraction/communication.md ================================================ --- layout: default title: "Communication" parent: "Core Abstraction" nav_order: 3 --- # Communication Nodes and Flows **communicate** in 2 ways: 1. **Shared Store (for almost all the cases)** - A global data structure (often an in-mem dict) that all nodes can read ( `prep()`) and write (`post()`). - Great for data results, large content, or anything multiple nodes need. - You shall design the data structure and populate it ahead. - > **Separation of Concerns:** Use `Shared Store` for almost all cases to separate *Data Schema* from *Compute Logic*! This approach is both flexible and easy to manage, resulting in more maintainable code. `Params` is more a syntax sugar for [Batch](./batch.md). {: .best-practice } 2. **Params (only for [Batch](./batch.md))** - Each node has a local, ephemeral `params` dict passed in by the **parent Flow**, used as an identifier for tasks. Parameter keys and values shall be **immutable**. - Good for identifiers like filenames or numeric IDs, in Batch mode. If you know memory management, think of the **Shared Store** like a **heap** (shared by all function calls), and **Params** like a **stack** (assigned by the caller). --- ## 1. Shared Store ### Overview A shared store is typically an in-mem dictionary, like: ```python shared = {"data": {}, "summary": {}, "config": {...}, ...} ``` It can also contain local file handlers, DB connections, or a combination for persistence. We recommend deciding the data structure or DB schema first based on your app requirements. ### Example ```python class LoadData(Node): def post(self, shared, prep_res, exec_res): # We write data to shared store shared["data"] = "Some text content" return None class Summarize(Node): def prep(self, shared): # We read data from shared store return shared["data"] def exec(self, prep_res): # Call LLM to summarize prompt = f"Summarize: {prep_res}" summary = call_llm(prompt) return summary def post(self, shared, prep_res, exec_res): # We write summary to shared store shared["summary"] = exec_res return "default" load_data = LoadData() summarize = Summarize() load_data >> summarize flow = Flow(start=load_data) shared = {} flow.run(shared) ``` Here: - `LoadData` writes to `shared["data"]`. - `Summarize` reads from `shared["data"]`, summarizes, and writes to `shared["summary"]`. --- ## 2. Params **Params** let you store *per-Node* or *per-Flow* config that doesn't need to live in the shared store. They are: - **Immutable** during a Node's run cycle (i.e., they don't change mid-`prep->exec->post`). - **Set** via `set_params()`. - **Cleared** and updated each time a parent Flow calls it. > Only set the uppermost Flow params because others will be overwritten by the parent Flow. > > If you need to set child node params, see [Batch](./batch.md). {: .warning } Typically, **Params** are identifiers (e.g., file name, page number). Use them to fetch the task you assigned or write to a specific part of the shared store. ### Example ```python # 1) Create a Node that uses params class SummarizeFile(Node): def prep(self, shared): # Access the node's param filename = self.params["filename"] return shared["data"].get(filename, "") def exec(self, prep_res): prompt = f"Summarize: {prep_res}" return call_llm(prompt) def post(self, shared, prep_res, exec_res): filename = self.params["filename"] shared["summary"][filename] = exec_res return "default" # 2) Set params node = SummarizeFile() # 3) Set Node params directly (for testing) node.set_params({"filename": "doc1.txt"}) node.run(shared) # 4) Create Flow flow = Flow(start=node) # 5) Set Flow params (overwrites node params) flow.set_params({"filename": "doc2.txt"}) flow.run(shared) # The node summarizes doc2, not doc1 ``` ================================================ File: docs/core_abstraction/flow.md ================================================ --- layout: default title: "Flow" parent: "Core Abstraction" nav_order: 2 --- # Flow A **Flow** orchestrates a graph of Nodes. You can chain Nodes in a sequence or create branching depending on the **Actions** returned from each Node's `post()`. ## 1. Action-based Transitions Each Node's `post()` returns an **Action** string. By default, if `post()` doesn't return anything, we treat that as `"default"`. You define transitions with the syntax: 1. **Basic default transition**: `node_a >> node_b` This means if `node_a.post()` returns `"default"`, go to `node_b`. (Equivalent to `node_a - "default" >> node_b`) 2. **Named action transition**: `node_a - "action_name" >> node_b` This means if `node_a.post()` returns `"action_name"`, go to `node_b`. It's possible to create loops, branching, or multi-step flows. ## 2. Creating a Flow A **Flow** begins with a **start** node. You call `Flow(start=some_node)` to specify the entry point. When you call `flow.run(shared)`, it executes the start node, looks at its returned Action from `post()`, follows the transition, and continues until there's no next node. ### Example: Simple Sequence Here's a minimal flow of two nodes in a chain: ```python node_a >> node_b flow = Flow(start=node_a) flow.run(shared) ``` - When you run the flow, it executes `node_a`. - Suppose `node_a.post()` returns `"default"`. - The flow then sees `"default"` Action is linked to `node_b` and runs `node_b`. - `node_b.post()` returns `"default"` but we didn't define `node_b >> something_else`. So the flow ends there. ### Example: Branching & Looping Here's a simple expense approval flow that demonstrates branching and looping. The `ReviewExpense` node can return three possible Actions: - `"approved"`: expense is approved, move to payment processing - `"needs_revision"`: expense needs changes, send back for revision - `"rejected"`: expense is denied, finish the process We can wire them like this: ```python # Define the flow connections review - "approved" >> payment # If approved, process payment review - "needs_revision" >> revise # If needs changes, go to revision review - "rejected" >> finish # If rejected, finish the process revise >> review # After revision, go back for another review payment >> finish # After payment, finish the process flow = Flow(start=review) ``` Let's see how it flows: 1. If `review.post()` returns `"approved"`, the expense moves to the `payment` node 2. If `review.post()` returns `"needs_revision"`, it goes to the `revise` node, which then loops back to `review` 3. If `review.post()` returns `"rejected"`, it moves to the `finish` node and stops ```mermaid flowchart TD review[Review Expense] -->|approved| payment[Process Payment] review -->|needs_revision| revise[Revise Report] review -->|rejected| finish[Finish Process] revise --> review payment --> finish ``` ### Running Individual Nodes vs. Running a Flow - `node.run(shared)`: Just runs that node alone (calls `prep->exec->post()`), returns an Action. - `flow.run(shared)`: Executes from the start node, follows Actions to the next node, and so on until the flow can't continue. > `node.run(shared)` **does not** proceed to the successor. > This is mainly for debugging or testing a single node. > > Always use `flow.run(...)` in production to ensure the full pipeline runs correctly. {: .warning } ## 3. Nested Flows A **Flow** can act like a Node, which enables powerful composition patterns. This means you can: 1. Use a Flow as a Node within another Flow's transitions. 2. Combine multiple smaller Flows into a larger Flow for reuse. 3. Node `params` will be a merging of **all** parents' `params`. ### Flow's Node Methods A **Flow** is also a **Node**, so it will run `prep()` and `post()`. However: - It **won't** run `exec()`, as its main logic is to orchestrate its nodes. - `post()` always receives `None` for `exec_res` and should instead get the flow execution results from the shared store. ### Basic Flow Nesting Here's how to connect a flow to another node: ```python # Create a sub-flow node_a >> node_b subflow = Flow(start=node_a) # Connect it to another node subflow >> node_c # Create the parent flow parent_flow = Flow(start=subflow) ``` When `parent_flow.run()` executes: 1. It starts `subflow` 2. `subflow` runs through its nodes (`node_a->node_b`) 3. After `subflow` completes, execution continues to `node_c` ### Example: Order Processing Pipeline Here's a practical example that breaks down order processing into nested flows: ```python # Payment processing sub-flow validate_payment >> process_payment >> payment_confirmation payment_flow = Flow(start=validate_payment) # Inventory sub-flow check_stock >> reserve_items >> update_inventory inventory_flow = Flow(start=check_stock) # Shipping sub-flow create_label >> assign_carrier >> schedule_pickup shipping_flow = Flow(start=create_label) # Connect the flows into a main order pipeline payment_flow >> inventory_flow >> shipping_flow # Create the master flow order_pipeline = Flow(start=payment_flow) # Run the entire pipeline order_pipeline.run(shared_data) ``` This creates a clean separation of concerns while maintaining a clear execution path: ```mermaid flowchart LR subgraph order_pipeline[Order Pipeline] subgraph paymentFlow["Payment Flow"] A[Validate Payment] --> B[Process Payment] --> C[Payment Confirmation] end subgraph inventoryFlow["Inventory Flow"] D[Check Stock] --> E[Reserve Items] --> F[Update Inventory] end subgraph shippingFlow["Shipping Flow"] G[Create Label] --> H[Assign Carrier] --> I[Schedule Pickup] end paymentFlow --> inventoryFlow inventoryFlow --> shippingFlow end ``` ================================================ File: docs/core_abstraction/node.md ================================================ --- layout: default title: "Node" parent: "Core Abstraction" nav_order: 1 --- # Node A **Node** is the smallest building block. Each Node has 3 steps `prep->exec->post`:
1. `prep(shared)` - **Read and preprocess data** from `shared` store. - Examples: *query DB, read files, or serialize data into a string*. - Return `prep_res`, which is used by `exec()` and `post()`. 2. `exec(prep_res)` - **Execute compute logic**, with optional retries and error handling (below). - Examples: *(mostly) LLM calls, remote APIs, tool use*. - ⚠️ This shall be only for compute and **NOT** access `shared`. - ⚠️ If retries enabled, ensure idempotent implementation. - Return `exec_res`, which is passed to `post()`. 3. `post(shared, prep_res, exec_res)` - **Postprocess and write data** back to `shared`. - Examples: *update DB, change states, log results*. - **Decide the next action** by returning a *string* (`action = "default"` if *None*). > **Why 3 steps?** To enforce the principle of *separation of concerns*. The data storage and data processing are operated separately. > > All steps are *optional*. E.g., you can only implement `prep` and `post` if you just need to process data. {: .note } ### Fault Tolerance & Retries You can **retry** `exec()` if it raises an exception via two parameters when define the Node: - `max_retries` (int): Max times to run `exec()`. The default is `1` (**no** retry). - `wait` (int): The time to wait (in **seconds**) before next retry. By default, `wait=0` (no waiting). `wait` is helpful when you encounter rate-limits or quota errors from your LLM provider and need to back off. ```python my_node = SummarizeFile(max_retries=3, wait=10) ``` When an exception occurs in `exec()`, the Node automatically retries until: - It either succeeds, or - The Node has retried `max_retries - 1` times already and fails on the last attempt. You can get the current retry times (0-based) from `self.cur_retry`. ```python class RetryNode(Node): def exec(self, prep_res): print(f"Retry {self.cur_retry} times") raise Exception("Failed") ``` ### Graceful Fallback To **gracefully handle** the exception (after all retries) rather than raising it, override: ```python def exec_fallback(self, prep_res, exc): raise exc ``` By default, it just re-raises exception. But you can return a fallback result instead, which becomes the `exec_res` passed to `post()`. ### Example: Summarize file ```python class SummarizeFile(Node): def prep(self, shared): return shared["data"] def exec(self, prep_res): if not prep_res: return "Empty file content" prompt = f"Summarize this text in 10 words: {prep_res}" summary = call_llm(prompt) # might fail return summary def exec_fallback(self, prep_res, exc): # Provide a simple fallback instead of crashing return "There was an error processing your request." def post(self, shared, prep_res, exec_res): shared["summary"] = exec_res # Return "default" by not returning summarize_node = SummarizeFile(max_retries=3) # node.run() calls prep->exec->post # If exec() fails, it retries up to 3 times before calling exec_fallback() action_result = summarize_node.run(shared) print("Action returned:", action_result) # "default" print("Summary stored:", shared["summary"]) ``` ================================================ File: docs/core_abstraction/parallel.md ================================================ --- layout: default title: "(Advanced) Parallel" parent: "Core Abstraction" nav_order: 6 --- # (Advanced) Parallel **Parallel** Nodes and Flows let you run multiple **Async** Nodes and Flows **concurrently**—for example, summarizing multiple texts at once. This can improve performance by overlapping I/O and compute. > Because of Python’s GIL, parallel nodes and flows can’t truly parallelize CPU-bound tasks (e.g., heavy numerical computations). However, they excel at overlapping I/O-bound work—like LLM calls, database queries, API requests, or file I/O. {: .warning } > - **Ensure Tasks Are Independent**: If each item depends on the output of a previous item, **do not** parallelize. > > - **Beware of Rate Limits**: Parallel calls can **quickly** trigger rate limits on LLM services. You may need a **throttling** mechanism (e.g., semaphores or sleep intervals). > > - **Consider Single-Node Batch APIs**: Some LLMs offer a **batch inference** API where you can send multiple prompts in a single call. This is more complex to implement but can be more efficient than launching many parallel requests and mitigates rate limits. {: .best-practice } ## AsyncParallelBatchNode Like **AsyncBatchNode**, but run `exec_async()` in **parallel**: ```python class ParallelSummaries(AsyncParallelBatchNode): async def prep_async(self, shared): # e.g., multiple texts return shared["texts"] async def exec_async(self, text): prompt = f"Summarize: {text}" return await call_llm_async(prompt) async def post_async(self, shared, prep_res, exec_res_list): shared["summary"] = "\n\n".join(exec_res_list) return "default" node = ParallelSummaries() flow = AsyncFlow(start=node) ``` ## AsyncParallelBatchFlow Parallel version of **BatchFlow**. Each iteration of the sub-flow runs **concurrently** using different parameters: ```python class SummarizeMultipleFiles(AsyncParallelBatchFlow): async def prep_async(self, shared): return [{"filename": f} for f in shared["files"]] sub_flow = AsyncFlow(start=LoadAndSummarizeFile()) parallel_flow = SummarizeMultipleFiles(start=sub_flow) await parallel_flow.run_async(shared) ``` ================================================ File: docs/design_pattern/agent.md ================================================ --- layout: default title: "Agent" parent: "Design Pattern" nav_order: 1 --- # Agent Agent is a powerful design pattern in which nodes can take dynamic actions based on the context.
## Implement Agent with Graph 1. **Context and Action:** Implement nodes that supply context and perform actions. 2. **Branching:** Use branching to connect each action node to an agent node. Use action to allow the agent to direct the [flow](../core_abstraction/flow.md) between nodes—and potentially loop back for multi-step. 3. **Agent Node:** Provide a prompt to decide action—for example: ```python f""" ### CONTEXT Task: {task_description} Previous Actions: {previous_actions} Current State: {current_state} ### ACTION SPACE [1] search Description: Use web search to get results Parameters: - query (str): What to search for [2] answer Description: Conclude based on the results Parameters: - result (str): Final answer to provide ### NEXT ACTION Decide the next action based on the current context and available action space. Return your response in the following format: ```yaml thinking: | action: parameters: : ```""" ``` The core of building **high-performance** and **reliable** agents boils down to: 1. **Context Management:** Provide *relevant, minimal context.* For example, rather than including an entire chat history, retrieve the most relevant via [RAG](./rag.md). Even with larger context windows, LLMs still fall victim to ["lost in the middle"](https://arxiv.org/abs/2307.03172), overlooking mid-prompt content. 2. **Action Space:** Provide *a well-structured and unambiguous* set of actions—avoiding overlap like separate `read_databases` or `read_csvs`. Instead, import CSVs into the database. ## Example Good Action Design - **Incremental:** Feed content in manageable chunks (500 lines or 1 page) instead of all at once. - **Overview-zoom-in:** First provide high-level structure (table of contents, summary), then allow drilling into details (raw texts). - **Parameterized/Programmable:** Instead of fixed actions, enable parameterized (columns to select) or programmable (SQL queries) actions, for example, to read CSV files. - **Backtracking:** Let the agent undo the last step instead of restarting entirely, preserving progress when encountering errors or dead ends. ## Example: Search Agent This agent: 1. Decides whether to search or answer 2. If searches, loops back to decide if more search needed 3. Answers when enough context gathered ```python class DecideAction(Node): def prep(self, shared): context = shared.get("context", "No previous search") query = shared["query"] return query, context def exec(self, inputs): query, context = inputs prompt = f""" Given input: {query} Previous search results: {context} Should I: 1) Search web for more info 2) Answer with current knowledge Output in yaml: ```yaml action: search/answer reason: why this action search_term: search phrase if action is search ```""" resp = call_llm(prompt) yaml_str = resp.split("```yaml")[1].split("```")[0].strip() result = yaml.safe_load(yaml_str) assert isinstance(result, dict) assert "action" in result assert "reason" in result assert result["action"] in ["search", "answer"] if result["action"] == "search": assert "search_term" in result return result def post(self, shared, prep_res, exec_res): if exec_res["action"] == "search": shared["search_term"] = exec_res["search_term"] return exec_res["action"] class SearchWeb(Node): def prep(self, shared): return shared["search_term"] def exec(self, search_term): return search_web(search_term) def post(self, shared, prep_res, exec_res): prev_searches = shared.get("context", []) shared["context"] = prev_searches + [ {"term": shared["search_term"], "result": exec_res} ] return "decide" class DirectAnswer(Node): def prep(self, shared): return shared["query"], shared.get("context", "") def exec(self, inputs): query, context = inputs return call_llm(f"Context: {context}\nAnswer: {query}") def post(self, shared, prep_res, exec_res): print(f"Answer: {exec_res}") shared["answer"] = exec_res # Connect nodes decide = DecideAction() search = SearchWeb() answer = DirectAnswer() decide - "search" >> search decide - "answer" >> answer search - "decide" >> decide # Loop back flow = Flow(start=decide) flow.run({"query": "Who won the Nobel Prize in Physics 2024?"}) ``` ================================================ File: docs/design_pattern/mapreduce.md ================================================ --- layout: default title: "Map Reduce" parent: "Design Pattern" nav_order: 4 --- # Map Reduce MapReduce is a design pattern suitable when you have either: - Large input data (e.g., multiple files to process), or - Large output data (e.g., multiple forms to fill) and there is a logical way to break the task into smaller, ideally independent parts.
You first break down the task using [BatchNode](../core_abstraction/batch.md) in the map phase, followed by aggregation in the reduce phase. ### Example: Document Summarization ```python class SummarizeAllFiles(BatchNode): def prep(self, shared): files_dict = shared["files"] # e.g. 10 files return list(files_dict.items()) # [("file1.txt", "aaa..."), ("file2.txt", "bbb..."), ...] def exec(self, one_file): filename, file_content = one_file summary_text = call_llm(f"Summarize the following file:\n{file_content}") return (filename, summary_text) def post(self, shared, prep_res, exec_res_list): shared["file_summaries"] = dict(exec_res_list) class CombineSummaries(Node): def prep(self, shared): return shared["file_summaries"] def exec(self, file_summaries): # format as: "File1: summary\nFile2: summary...\n" text_list = [] for fname, summ in file_summaries.items(): text_list.append(f"{fname} summary:\n{summ}\n") big_text = "\n---\n".join(text_list) return call_llm(f"Combine these file summaries into one final summary:\n{big_text}") def post(self, shared, prep_res, final_summary): shared["all_files_summary"] = final_summary batch_node = SummarizeAllFiles() combine_node = CombineSummaries() batch_node >> combine_node flow = Flow(start=batch_node) shared = { "files": { "file1.txt": "Alice was beginning to get very tired of sitting by her sister...", "file2.txt": "Some other interesting text ...", # ... } } flow.run(shared) print("Individual Summaries:", shared["file_summaries"]) print("\nFinal Summary:\n", shared["all_files_summary"]) ``` ================================================ File: docs/design_pattern/rag.md ================================================ --- layout: default title: "RAG" parent: "Design Pattern" nav_order: 3 --- # RAG (Retrieval Augmented Generation) For certain LLM tasks like answering questions, providing relevant context is essential. One common architecture is a **two-stage** RAG pipeline:
1. **Offline stage**: Preprocess and index documents ("building the index"). 2. **Online stage**: Given a question, generate answers by retrieving the most relevant context. --- ## Stage 1: Offline Indexing We create three Nodes: 1. `ChunkDocs` – [chunks](../utility_function/chunking.md) raw text. 2. `EmbedDocs` – [embeds](../utility_function/embedding.md) each chunk. 3. `StoreIndex` – stores embeddings into a [vector database](../utility_function/vector.md). ```python class ChunkDocs(BatchNode): def prep(self, shared): # A list of file paths in shared["files"]. We process each file. return shared["files"] def exec(self, filepath): # read file content. In real usage, do error handling. with open(filepath, "r", encoding="utf-8") as f: text = f.read() # chunk by 100 chars each chunks = [] size = 100 for i in range(0, len(text), size): chunks.append(text[i : i + size]) return chunks def post(self, shared, prep_res, exec_res_list): # exec_res_list is a list of chunk-lists, one per file. # flatten them all into a single list of chunks. all_chunks = [] for chunk_list in exec_res_list: all_chunks.extend(chunk_list) shared["all_chunks"] = all_chunks class EmbedDocs(BatchNode): def prep(self, shared): return shared["all_chunks"] def exec(self, chunk): return get_embedding(chunk) def post(self, shared, prep_res, exec_res_list): # Store the list of embeddings. shared["all_embeds"] = exec_res_list print(f"Total embeddings: {len(exec_res_list)}") class StoreIndex(Node): def prep(self, shared): # We'll read all embeds from shared. return shared["all_embeds"] def exec(self, all_embeds): # Create a vector index (faiss or other DB in real usage). index = create_index(all_embeds) return index def post(self, shared, prep_res, index): shared["index"] = index # Wire them in sequence chunk_node = ChunkDocs() embed_node = EmbedDocs() store_node = StoreIndex() chunk_node >> embed_node >> store_node OfflineFlow = Flow(start=chunk_node) ``` Usage example: ```python shared = { "files": ["doc1.txt", "doc2.txt"], # any text files } OfflineFlow.run(shared) ``` --- ## Stage 2: Online Query & Answer We have 3 nodes: 1. `EmbedQuery` – embeds the user’s question. 2. `RetrieveDocs` – retrieves top chunk from the index. 3. `GenerateAnswer` – calls the LLM with the question + chunk to produce the final answer. ```python class EmbedQuery(Node): def prep(self, shared): return shared["question"] def exec(self, question): return get_embedding(question) def post(self, shared, prep_res, q_emb): shared["q_emb"] = q_emb class RetrieveDocs(Node): def prep(self, shared): # We'll need the query embedding, plus the offline index/chunks return shared["q_emb"], shared["index"], shared["all_chunks"] def exec(self, inputs): q_emb, index, chunks = inputs I, D = search_index(index, q_emb, top_k=1) best_id = I[0][0] relevant_chunk = chunks[best_id] return relevant_chunk def post(self, shared, prep_res, relevant_chunk): shared["retrieved_chunk"] = relevant_chunk print("Retrieved chunk:", relevant_chunk[:60], "...") class GenerateAnswer(Node): def prep(self, shared): return shared["question"], shared["retrieved_chunk"] def exec(self, inputs): question, chunk = inputs prompt = f"Question: {question}\nContext: {chunk}\nAnswer:" return call_llm(prompt) def post(self, shared, prep_res, answer): shared["answer"] = answer print("Answer:", answer) embed_qnode = EmbedQuery() retrieve_node = RetrieveDocs() generate_node = GenerateAnswer() embed_qnode >> retrieve_node >> generate_node OnlineFlow = Flow(start=embed_qnode) ``` Usage example: ```python # Suppose we already ran OfflineFlow and have: # shared["all_chunks"], shared["index"], etc. shared["question"] = "Why do people like cats?" OnlineFlow.run(shared) # final answer in shared["answer"] ``` ================================================ File: docs/design_pattern/structure.md ================================================ --- layout: default title: "Structured Output" parent: "Design Pattern" nav_order: 5 --- # Structured Output In many use cases, you may want the LLM to output a specific structure, such as a list or a dictionary with predefined keys. There are several approaches to achieve a structured output: - **Prompting** the LLM to strictly return a defined structure. - Using LLMs that natively support **schema enforcement**. - **Post-processing** the LLM's response to extract structured content. In practice, **Prompting** is simple and reliable for modern LLMs. ### Example Use Cases - Extracting Key Information ```yaml product: name: Widget Pro price: 199.99 description: | A high-quality widget designed for professionals. Recommended for advanced users. ``` - Summarizing Documents into Bullet Points ```yaml summary: - This product is easy to use. - It is cost-effective. - Suitable for all skill levels. ``` - Generating Configuration Files ```yaml server: host: 127.0.0.1 port: 8080 ssl: true ``` ## Prompt Engineering When prompting the LLM to produce **structured** output: 1. **Wrap** the structure in code fences (e.g., `yaml`). 2. **Validate** that all required fields exist (and let `Node` handles retry). ### Example Text Summarization ```python class SummarizeNode(Node): def exec(self, prep_res): # Suppose `prep_res` is the text to summarize. prompt = f""" Please summarize the following text as YAML, with exactly 3 bullet points {prep_res} Now, output: ```yaml summary: - bullet 1 - bullet 2 - bullet 3 ```""" response = call_llm(prompt) yaml_str = response.split("```yaml")[1].split("```")[0].strip() import yaml structured_result = yaml.safe_load(yaml_str) assert "summary" in structured_result assert isinstance(structured_result["summary"], list) return structured_result ``` > Besides using `assert` statements, another popular way to validate schemas is [Pydantic](https://github.com/pydantic/pydantic) {: .note } ### Why YAML instead of JSON? Current LLMs struggle with escaping. YAML is easier with strings since they don't always need quotes. **In JSON** ```json { "dialogue": "Alice said: \"Hello Bob.\\nHow are you?\\nI am good.\"" } ``` - Every double quote inside the string must be escaped with `\"`. - Each newline in the dialogue must be represented as `\n`. **In YAML** ```yaml dialogue: | Alice said: "Hello Bob. How are you? I am good." ``` - No need to escape interior quotes—just place the entire text under a block literal (`|`). - Newlines are naturally preserved without needing `\n`. ================================================ File: docs/design_pattern/workflow.md ================================================ --- layout: default title: "Workflow" parent: "Design Pattern" nav_order: 2 --- # Workflow Many real-world tasks are too complex for one LLM call. The solution is to **Task Decomposition**: decompose them into a [chain](../core_abstraction/flow.md) of multiple Nodes.
> - You don't want to make each task **too coarse**, because it may be *too complex for one LLM call*. > - You don't want to make each task **too granular**, because then *the LLM call doesn't have enough context* and results are *not consistent across nodes*. > > You usually need multiple *iterations* to find the *sweet spot*. If the task has too many *edge cases*, consider using [Agents](./agent.md). {: .best-practice } ### Example: Article Writing ```python class GenerateOutline(Node): def prep(self, shared): return shared["topic"] def exec(self, topic): return call_llm(f"Create a detailed outline for an article about {topic}") def post(self, shared, prep_res, exec_res): shared["outline"] = exec_res class WriteSection(Node): def prep(self, shared): return shared["outline"] def exec(self, outline): return call_llm(f"Write content based on this outline: {outline}") def post(self, shared, prep_res, exec_res): shared["draft"] = exec_res class ReviewAndRefine(Node): def prep(self, shared): return shared["draft"] def exec(self, draft): return call_llm(f"Review and improve this draft: {draft}") def post(self, shared, prep_res, exec_res): shared["final_article"] = exec_res # Connect nodes outline = GenerateOutline() write = WriteSection() review = ReviewAndRefine() outline >> write >> review # Create and run flow writing_flow = Flow(start=outline) shared = {"topic": "AI Safety"} writing_flow.run(shared) ``` For *dynamic cases*, consider using [Agents](./agent.md). ================================================ File: docs/utility_function/llm.md ================================================ --- layout: default title: "LLM Wrapper" parent: "Utility Function" nav_order: 1 --- # LLM Wrappers Check out libraries like [litellm](https://github.com/BerriAI/litellm). Here, we provide some minimal example implementations: 1. OpenAI ```python def call_llm(prompt): from openai import OpenAI client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content # Example usage call_llm("How are you?") ``` > Store the API key in an environment variable like OPENAI_API_KEY for security. {: .best-practice } 2. Claude (Anthropic) ```python def call_llm(prompt): from anthropic import Anthropic client = Anthropic(api_key="YOUR_API_KEY_HERE") response = client.messages.create( model="claude-2", messages=[{"role": "user", "content": prompt}], max_tokens=100 ) return response.content ``` 3. Google (Generative AI Studio / PaLM API) ```python def call_llm(prompt): import google.generativeai as genai genai.configure(api_key="YOUR_API_KEY_HERE") response = genai.generate_text( model="models/text-bison-001", prompt=prompt ) return response.result ``` 4. Azure (Azure OpenAI) ```python def call_llm(prompt): from openai import AzureOpenAI client = AzureOpenAI( azure_endpoint="https://.openai.azure.com/", api_key="YOUR_API_KEY_HERE", api_version="2023-05-15" ) r = client.chat.completions.create( model="", messages=[{"role": "user", "content": prompt}] ) return r.choices[0].message.content ``` 5. Ollama (Local LLM) ```python def call_llm(prompt): from ollama import chat response = chat( model="llama2", messages=[{"role": "user", "content": prompt}] ) return response.message.content ``` ## Improvements Feel free to enhance your `call_llm` function as needed. Here are examples: - Handle chat history: ```python def call_llm(messages): from openai import OpenAI client = OpenAI(api_key="YOUR_API_KEY_HERE") r = client.chat.completions.create( model="gpt-4o", messages=messages ) return r.choices[0].message.content ``` - Add in-memory caching ```python from functools import lru_cache @lru_cache(maxsize=1000) def call_llm(prompt): # Your implementation here pass ``` > ⚠️ Caching conflicts with Node retries, as retries yield the same result. > > To address this, you could use cached results only if not retried. {: .warning } ```python from functools import lru_cache @lru_cache(maxsize=1000) def cached_call(prompt): pass def call_llm(prompt, use_cache): if use_cache: return cached_call(prompt) # Call the underlying function directly return cached_call.__wrapped__(prompt) class SummarizeNode(Node): def exec(self, text): return call_llm(f"Summarize: {text}", self.cur_retry==0) ``` - Enable logging: ```python def call_llm(prompt): import logging logging.info(f"Prompt: {prompt}") response = ... # Your implementation here logging.info(f"Response: {response}") return response ``` ================================================ FILE: Dockerfile ================================================ FROM python:3.10-slim # update packages, install git and remove cache RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . ENTRYPOINT ["python", "main.py"] ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2025 Zachary Huang Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================

Turns Codebase into Easy Tutorial with AI

![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg) > *Ever stared at a new codebase written by others feeling completely lost? This tutorial shows you how to build an AI agent that analyzes GitHub repositories and creates beginner-friendly tutorials explaining exactly how the code works.*

This is a tutorial project of [Pocket Flow](https://github.com/The-Pocket/PocketFlow), a 100-line LLM framework. It crawls GitHub repositories and builds a knowledge base from the code. It analyzes entire codebases to identify core abstractions and how they interact, and transforms complex code into beginner-friendly tutorials with clear visualizations. - Check out the [YouTube Development Tutorial](https://youtu.be/AFY67zOpbSo) for more! - Check out the [Substack Post Tutorial](https://zacharyhuang.substack.com/p/ai-codebase-knowledge-builder-full) for more!   **🔸 🎉 Reached Hacker News Front Page** (April 2025) with >900 up‑votes: [Discussion »](https://news.ycombinator.com/item?id=43739456)   **🔸 🎊 Online Service Now Live!** (May 2025) Try our new online version at [https://code2tutorial.com/](https://code2tutorial.com/) – just paste a GitHub link, no installation needed! ## ⭐ Example Results for Popular GitHub Repositories!

🤯 All these tutorials are generated **entirely by AI** by crawling the GitHub repo! - [AutoGen Core](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/AutoGen%20Core) - Build AI teams that talk, think, and solve problems together like coworkers! - [Browser Use](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Browser%20Use) - Let AI surf the web for you, clicking buttons and filling forms like a digital assistant! - [Celery](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Celery) - Supercharge your app with background tasks that run while you sleep! - [Click](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Click) - Turn Python functions into slick command-line tools with just a decorator! - [Codex](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Codex) - Turn plain English into working code with this AI terminal wizard! - [Crawl4AI](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Crawl4AI) - Train your AI to extract exactly what matters from any website! - [CrewAI](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/CrewAI) - Assemble a dream team of AI specialists to tackle impossible problems! - [DSPy](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/DSPy) - Build LLM apps like Lego blocks that optimize themselves! - [FastAPI](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/FastAPI) - Create APIs at lightning speed with automatic docs that clients will love! - [Flask](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Flask) - Craft web apps with minimal code that scales from prototype to production! - [Google A2A](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Google%20A2A) - The universal language that lets AI agents collaborate across borders! - [LangGraph](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/LangGraph) - Design AI agents as flowcharts where each step remembers what happened before! - [LevelDB](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/LevelDB) - Store data at warp speed with Google's engine that powers blockchains! - [MCP Python SDK](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/MCP%20Python%20SDK) - Build powerful apps that communicate through an elegant protocol without sweating the details! - [NumPy Core](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/NumPy%20Core) - Master the engine behind data science that makes Python as fast as C! - [OpenManus](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/OpenManus) - Build AI agents with digital brains that think, learn, and use tools just like humans do! - [PocketFlow](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/PocketFlow) - 100-line LLM framework. Let Agents build Agents! - [Pydantic Core](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Pydantic%20Core) - Validate data at rocket speed with just Python type hints! - [Requests](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/Requests) - Talk to the internet in Python with code so simple it feels like cheating! - [SmolaAgents](https://the-pocket.github.io/PocketFlow-Tutorial-Codebase-Knowledge/SmolaAgents) - Build tiny AI agents that punch way above their weight class! - Showcase Your AI-Generated Tutorials in [Discussions](https://github.com/The-Pocket/PocketFlow-Tutorial-Codebase-Knowledge/discussions)! ## 🚀 Getting Started 1. Clone this repository ```bash git clone https://github.com/The-Pocket/PocketFlow-Tutorial-Codebase-Knowledge ``` 3. Install dependencies: ```bash pip install -r requirements.txt ``` 4. Set up LLM in [`utils/call_llm.py`](./utils/call_llm.py) by providing credentials. To do so, you can put the values in a `.env` file. By default, you can use the AI Studio key with this client for Gemini Pro 2.5 by setting the `GEMINI_API_KEY` environment variable. If you want to use another LLM, you can set the `LLM_PROVIDER` environment variable (e.g. `XAI`), and then set the model, url, and API key (e.g. `XAI_MODEL`, `XAI_URL`,`XAI_API_KEY`). If using Ollama, the url is `http://localhost:11434/` and the API key can be omitted. You can use your own models. We highly recommend the latest models with thinking capabilities (Claude 3.7 with thinking, O1). You can verify that it is correctly set up by running: ```bash python utils/call_llm.py ``` 5. Generate a complete codebase tutorial by running the main script: ```bash # Analyze a GitHub repository python main.py --repo https://github.com/username/repo --include "*.py" "*.js" --exclude "tests/*" --max-size 50000 # Or, analyze a local directory python main.py --dir /path/to/your/codebase --include "*.py" --exclude "*test*" # Or, generate a tutorial in Chinese python main.py --repo https://github.com/username/repo --language "Chinese" ``` - `--repo` or `--dir` - Specify either a GitHub repo URL or a local directory path (required, mutually exclusive) - `-n, --name` - Project name (optional, derived from URL/directory if omitted) - `-t, --token` - GitHub token (or set GITHUB_TOKEN environment variable) - `-o, --output` - Output directory (default: ./output) - `-i, --include` - Files to include (e.g., "`*.py`" "`*.js`") - `-e, --exclude` - Files to exclude (e.g., "`tests/*`" "`docs/*`") - `-s, --max-size` - Maximum file size in bytes (default: 100KB) - `--language` - Language for the generated tutorial (default: "english") - `--max-abstractions` - Maximum number of abstractions to identify (default: 10) - `--no-cache` - Disable LLM response caching (default: caching enabled) The application will crawl the repository, analyze the codebase structure, generate tutorial content in the specified language, and save the output in the specified directory (default: ./output).
🐳 Running with Docker To run this project in a Docker container, you'll need to pass your API keys as environment variables. 1. Build the Docker image ```bash docker build -t pocketflow-app . ``` 2. Run the container You'll need to provide your `GEMINI_API_KEY` for the LLM to function. If you're analyzing private GitHub repositories or want to avoid rate limits, also provide your `GITHUB_TOKEN`. Mount a local directory to `/app/output` inside the container to access the generated tutorials on your host machine. **Example for analyzing a public GitHub repository:** ```bash docker run -it --rm \ -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \ -v "$(pwd)/output_tutorials":/app/output \ pocketflow-app --repo https://github.com/username/repo ``` **Example for analyzing a local directory:** ```bash docker run -it --rm \ -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \ -v "/path/to/your/local_codebase":/app/code_to_analyze \ -v "$(pwd)/output_tutorials":/app/output \ pocketflow-app --dir /app/code_to_analyze ```
## 💡 Development Tutorial - I built using [**Agentic Coding**](https://zacharyhuang.substack.com/p/agentic-coding-the-most-fun-way-to), the fastest development paradigm, where humans simply [design](docs/design.md) and agents [code](flow.py). - The secret weapon is [Pocket Flow](https://github.com/The-Pocket/PocketFlow), a 100-line LLM framework that lets Agents (e.g., Cursor AI) build for you - Check out the Step-by-step YouTube development tutorial:

================================================ FILE: docs/AutoGen Core/01_agent.md ================================================ --- layout: default title: "Agent" parent: "AutoGen Core" nav_order: 1 --- # Chapter 1: Agent - The Workers of AutoGen Welcome to the AutoGen Core tutorial! We're excited to guide you through building powerful applications with autonomous agents. ## Motivation: Why Do We Need Agents? Imagine you want to build an automated system to write blog posts. You might need one part of the system to research a topic and another part to write the actual post based on the research. How do you represent these different "workers" and make them talk to each other? This is where the concept of an **Agent** comes in. In AutoGen Core, an `Agent` is the fundamental building block representing an actor or worker in your system. Think of it like an employee in an office. ## Key Concepts: Understanding Agents Let's break down what makes an Agent: 1. **It's a Worker:** An Agent is designed to *do* things. This could be running calculations, calling a Large Language Model (LLM) like ChatGPT, using a tool (like a search engine), or managing a piece of data. 2. **It Has an Identity (`AgentId`):** Just like every employee has a name and a job title, every Agent needs a unique identity. This identity, called `AgentId`, has two parts: * `type`: What kind of role does the agent have? (e.g., "researcher", "writer", "coder"). This helps organize agents. * `key`: A unique name for this specific agent instance (e.g., "researcher-01", "amy-the-writer"). ```python # From: _agent_id.py class AgentId: def __init__(self, type: str, key: str) -> None: # ... (validation checks omitted for brevity) self._type = type self._key = key @property def type(self) -> str: return self._type @property def key(self) -> str: return self._key def __str__(self) -> str: # Creates an id like "researcher/amy-the-writer" return f"{self._type}/{self._key}" ``` This `AgentId` acts like the agent's address, allowing other agents (or the system) to send messages specifically to it. 3. **It Has Metadata (`AgentMetadata`):** Besides its core identity, an agent often has descriptive information. * `type`: Same as in `AgentId`. * `key`: Same as in `AgentId`. * `description`: A human-readable explanation of what the agent does (e.g., "Researches topics using web search"). ```python # From: _agent_metadata.py from typing import TypedDict class AgentMetadata(TypedDict): type: str key: str description: str ``` This metadata helps understand the agent's purpose within the system. 4. **It Communicates via Messages:** Agents don't work in isolation. They collaborate by sending and receiving messages. The primary way an agent receives work is through its `on_message` method. Think of this like the agent's inbox. ```python # From: _agent.py (Simplified Agent Protocol) from typing import Any, Mapping, Protocol # ... other imports class Agent(Protocol): @property def id(self) -> AgentId: ... # The agent's unique ID async def on_message(self, message: Any, ctx: MessageContext) -> Any: """Handles an incoming message.""" # Agent's logic to process the message goes here ... ``` When an agent receives a message, `on_message` is called. The `message` contains the data or task, and `ctx` (MessageContext) provides extra information about the message (like who sent it). We'll cover `MessageContext` more later. 5. **It Can Remember Things (State):** Sometimes, an agent needs to remember information between tasks, like keeping notes on research progress. Agents can optionally implement `save_state` and `load_state` methods to store and retrieve their internal memory. ```python # From: _agent.py (Simplified Agent Protocol) class Agent(Protocol): # ... other methods async def save_state(self) -> Mapping[str, Any]: """Save the agent's internal memory.""" # Return a dictionary representing the state ... async def load_state(self, state: Mapping[str, Any]) -> None: """Load the agent's internal memory.""" # Restore state from the dictionary ... ``` We'll explore state and memory in more detail in [Chapter 7: Memory](07_memory.md). 6. **Different Agent Types:** AutoGen Core provides base classes to make creating agents easier: * `BaseAgent`: The fundamental class most agents inherit from. It provides common setup. * `ClosureAgent`: A very quick way to create simple agents using just a function (like hiring a temp worker for a specific task defined on the spot). * `RoutedAgent`: An agent that can automatically direct different types of messages to different internal handler methods (like a smart receptionist). ## Use Case Example: Researcher and Writer Let's revisit our blog post example. We want a `Researcher` agent and a `Writer` agent. **Goal:** 1. Tell the `Researcher` a topic (e.g., "AutoGen Agents"). 2. The `Researcher` finds some facts (we'll keep it simple and just make them up for now). 3. The `Researcher` sends these facts to the `Writer`. 4. The `Writer` receives the facts and drafts a short post. **Simplified Implementation Idea (using `ClosureAgent` for brevity):** First, let's define the messages they might exchange: ```python from dataclasses import dataclass @dataclass class ResearchTopic: topic: str @dataclass class ResearchFacts: topic: str facts: list[str] @dataclass class DraftPost: topic: str draft: str ``` These are simple Python classes to hold the data being passed around. Now, let's imagine defining the `Researcher` using a `ClosureAgent`. This agent will listen for `ResearchTopic` messages. ```python # Simplified concept - requires AgentRuntime (Chapter 3) to actually run async def researcher_logic(agent_context, message: ResearchTopic, msg_context): print(f"Researcher received topic: {message.topic}") # In a real scenario, this would involve searching, calling an LLM, etc. # For now, we just make up facts. facts = [f"Fact 1 about {message.topic}", f"Fact 2 about {message.topic}"] print(f"Researcher found facts: {facts}") # Find the Writer agent's ID (we assume we know it) writer_id = AgentId(type="writer", key="blog_writer_1") # Send the facts to the Writer await agent_context.send_message( message=ResearchFacts(topic=message.topic, facts=facts), recipient=writer_id, ) print("Researcher sent facts to Writer.") # This agent doesn't return a direct reply return None ``` This `researcher_logic` function defines *what* the researcher does when it gets a `ResearchTopic` message. It processes the topic, creates `ResearchFacts`, and uses `agent_context.send_message` to send them to the `writer` agent. Similarly, the `Writer` agent would have its own logic: ```python # Simplified concept - requires AgentRuntime (Chapter 3) to actually run async def writer_logic(agent_context, message: ResearchFacts, msg_context): print(f"Writer received facts for topic: {message.topic}") # In a real scenario, this would involve LLM prompting draft = f"Blog Post about {message.topic}:\n" for fact in message.facts: draft += f"- {fact}\n" print(f"Writer drafted post:\n{draft}") # Perhaps save the draft or send it somewhere else # For now, we just print it. We don't send another message. return None # Or maybe return a confirmation/result ``` This `writer_logic` function defines how the writer reacts to receiving `ResearchFacts`. **Important:** To actually *run* these agents and make them communicate, we need the `AgentRuntime` (covered in [Chapter 3: AgentRuntime](03_agentruntime.md)) and the `Messaging System` (covered in [Chapter 2: Messaging System](02_messaging_system__topic___subscription_.md)). For now, focus on the *idea* that Agents are distinct workers defined by their logic (`on_message`) and identified by their `AgentId`. ## Under the Hood: How an Agent Gets a Message While the full message delivery involves the `Messaging System` and `AgentRuntime`, let's look at the agent's role when it receives a message. **Conceptual Flow:** ```mermaid sequenceDiagram participant Sender as Sender Agent participant Runtime as AgentRuntime participant Recipient as Recipient Agent Sender->>+Runtime: send_message(message, recipient_id) Runtime->>+Recipient: Locate agent by recipient_id Runtime->>+Recipient: on_message(message, context) Recipient->>Recipient: Process message using internal logic alt Response Needed Recipient->>-Runtime: Return response value Runtime->>-Sender: Deliver response value else No Response Recipient->>-Runtime: Return None (or no return) end ``` 1. Some other agent (Sender) or the system decides to send a message to our agent (Recipient). 2. It tells the `AgentRuntime` (the manager): "Deliver this `message` to the agent with `recipient_id`". 3. The `AgentRuntime` finds the correct `Recipient` agent instance. 4. The `AgentRuntime` calls the `Recipient.on_message(message, context)` method. 5. The agent's internal logic inside `on_message` (or methods called by it, like in `RoutedAgent`) runs to process the message. 6. If the message requires a direct response (like an RPC call), the agent returns a value from `on_message`. If not (like a general notification or event), it might return `None`. **Code Glimpse:** The core definition is the `Agent` Protocol (`_agent.py`). It's like an interface or a contract – any class wanting to be an Agent *must* provide these methods. ```python # From: _agent.py - The Agent blueprint (Protocol) @runtime_checkable class Agent(Protocol): @property def metadata(self) -> AgentMetadata: ... @property def id(self) -> AgentId: ... async def on_message(self, message: Any, ctx: MessageContext) -> Any: ... async def save_state(self) -> Mapping[str, Any]: ... async def load_state(self, state: Mapping[str, Any]) -> None: ... async def close(self) -> None: ... ``` Most agents you create will inherit from `BaseAgent` (`_base_agent.py`). It provides some standard setup: ```python # From: _base_agent.py (Simplified) class BaseAgent(ABC, Agent): def __init__(self, description: str) -> None: # Gets runtime & id from a special context when created by the runtime # Raises error if you try to create it directly! self._runtime: AgentRuntime = AgentInstantiationContext.current_runtime() self._id: AgentId = AgentInstantiationContext.current_agent_id() self._description = description # ... # This is the final version called by the runtime @final async def on_message(self, message: Any, ctx: MessageContext) -> Any: # It calls the implementation method you need to write return await self.on_message_impl(message, ctx) # You MUST implement this in your subclass @abstractmethod async def on_message_impl(self, message: Any, ctx: MessageContext) -> Any: ... # Helper to send messages easily async def send_message(self, message: Any, recipient: AgentId, ...) -> Any: # It just asks the runtime to do the actual sending return await self._runtime.send_message( message, sender=self.id, recipient=recipient, ... ) # ... other methods like publish_message, save_state, load_state ``` Notice how `BaseAgent` handles getting its `id` and `runtime` during creation and provides a convenient `send_message` method that uses the runtime. When inheriting from `BaseAgent`, you primarily focus on implementing the `on_message_impl` method to define your agent's unique behavior. ## Next Steps You now understand the core concept of an `Agent` in AutoGen Core! It's the fundamental worker unit with an identity, the ability to process messages, and optionally maintain state. In the next chapters, we'll explore: * [Chapter 2: Messaging System](02_messaging_system__topic___subscription_.md): How messages actually travel between agents. * [Chapter 3: AgentRuntime](03_agentruntime.md): The manager responsible for creating, running, and connecting agents. Let's continue building your understanding! --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/02_messaging_system__topic___subscription_.md ================================================ --- layout: default title: "Messaging System" parent: "AutoGen Core" nav_order: 2 --- # Chapter 2: Messaging System (Topic & Subscription) In [Chapter 1: Agent](01_agent.md), we learned about Agents as individual workers. But how do they coordinate when one agent doesn't know exactly *who* needs the information it produces? Imagine our Researcher finds some facts. Maybe the Writer needs them, but maybe a Fact-Checker agent or a Summary agent also needs them later. How can the Researcher just announce "Here are the facts!" without needing a specific mailing list? This is where the **Messaging System**, specifically **Topics** and **Subscriptions**, comes in. It allows agents to broadcast messages to anyone interested, like posting on a company announcement board. ## Motivation: Broadcasting Information Let's refine our blog post example: 1. The `Researcher` agent finds facts about "AutoGen Agents". 2. Instead of sending *directly* to the `Writer`, the `Researcher` **publishes** these facts to a general "research-results" **Topic**. 3. The `Writer` agent has previously told the system it's **subscribed** to the "research-results" Topic. 4. The system sees the new message on the Topic and delivers it to the `Writer` (and any other subscribers). This way, the `Researcher` doesn't need to know who the `Writer` is, or even if a `Writer` exists! It just broadcasts the results. If we later add a `FactChecker` agent that also needs the results, it simply subscribes to the same Topic. ## Key Concepts: Topics and Subscriptions Let's break down the components of this broadcasting system: 1. **Topic (`TopicId`): The Announcement Board** * A `TopicId` represents a specific channel or category for messages. Think of it like the name of an announcement board (e.g., "Project Updates", "General Announcements"). * It has two main parts: * `type`: What *kind* of event or information is this? (e.g., "research.completed", "user.request"). This helps categorize messages. * `source`: *Where* or *why* did this event originate? Often, this relates to the specific task or context (e.g., the specific blog post being researched like "autogen-agents-blog-post", or the team generating the event like "research-team"). ```python # From: _topic.py (Simplified) from dataclasses import dataclass @dataclass(frozen=True) # Immutable: can't change after creation class TopicId: type: str source: str def __str__(self) -> str: # Creates an id like "research.completed/autogen-agents-blog-post" return f"{self.type}/{self.source}" ``` This structure allows for flexible filtering. Agents might subscribe to all topics of a certain `type`, regardless of the `source`, or only to topics with a specific `source`. 2. **Publishing: Posting the Announcement** * When an agent has information to share broadly, it *publishes* a message to a specific `TopicId`. * This is like pinning a note to the designated announcement board. The agent doesn't need to know who will read it. 3. **Subscription (`Subscription`): Signing Up for Updates** * A `Subscription` is how an agent declares its interest in certain `TopicId`s. * It acts like a rule: "If a message is published to a Topic that matches *this pattern*, please deliver it to *this kind of agent*". * The `Subscription` links a `TopicId` pattern (e.g., "all topics with type `research.completed`") to an `AgentId` (or a way to determine the `AgentId`). 4. **Routing: Delivering the Mail** * The `AgentRuntime` (the system manager we'll meet in [Chapter 3: AgentRuntime](03_agentruntime.md)) keeps track of all active `Subscription`s. * When a message is published to a `TopicId`, the `AgentRuntime` checks which `Subscription`s match that `TopicId`. * For each match, it uses the `Subscription`'s rule to figure out which specific `AgentId` should receive the message and delivers it. ## Use Case Example: Researcher Publishes, Writer Subscribes Let's see how our Researcher and Writer can use this system. **Goal:** Researcher publishes facts to a topic, Writer receives them via subscription. **1. Define the Topic:** We need a `TopicId` for research results. Let's say the `type` is "research.facts.available" and the `source` identifies the specific research task (e.g., "blog-post-autogen"). ```python # From: _topic.py from autogen_core import TopicId # Define the topic for this specific research task research_topic_id = TopicId(type="research.facts.available", source="blog-post-autogen") print(f"Topic ID: {research_topic_id}") # Output: Topic ID: research.facts.available/blog-post-autogen ``` This defines the "announcement board" we'll use. **2. Researcher Publishes:** The `Researcher` agent, after finding facts, will use its `agent_context` (provided by the runtime) to publish the `ResearchFacts` message to this topic. ```python # Simplified concept - Researcher agent logic # Assume 'agent_context' and 'message' (ResearchTopic) are provided # Define the facts message (from Chapter 1) @dataclass class ResearchFacts: topic: str facts: list[str] async def researcher_publish_logic(agent_context, message: ResearchTopic, msg_context): print(f"Researcher working on: {message.topic}") facts_data = ResearchFacts( topic=message.topic, facts=[f"Fact A about {message.topic}", f"Fact B about {message.topic}"] ) # Define the specific topic for this task's results results_topic = TopicId(type="research.facts.available", source=message.topic) # Use message topic as source # Publish the facts to the topic await agent_context.publish_message(message=facts_data, topic_id=results_topic) print(f"Researcher published facts to topic: {results_topic}") # No direct reply needed return None ``` Notice the `agent_context.publish_message` call. The Researcher doesn't specify a recipient, only the topic. **3. Writer Subscribes:** The `Writer` agent needs to tell the system it's interested in messages on topics like "research.facts.available". We can use a predefined `Subscription` type called `TypeSubscription`. This subscription typically means: "I am interested in all topics with this *exact type*. When a message arrives, create/use an agent of *my type* whose `key` matches the topic's `source`." ```python # From: _type_subscription.py (Simplified Concept) from autogen_core import TypeSubscription, BaseAgent class WriterAgent(BaseAgent): # ... agent implementation ... async def on_message_impl(self, message: ResearchFacts, ctx): # This method gets called when a subscribed message arrives print(f"Writer ({self.id}) received facts via subscription: {message.facts}") # ... process facts and write draft ... # How the Writer subscribes (usually done during runtime setup - Chapter 3) # This tells the runtime: "Messages on topics with type 'research.facts.available' # should go to a 'writer' agent whose key matches the topic source." writer_subscription = TypeSubscription( topic_type="research.facts.available", agent_type="writer" # The type of agent that should handle this ) print(f"Writer subscription created for topic type: {writer_subscription.topic_type}") # Output: Writer subscription created for topic type: research.facts.available ``` When the `Researcher` publishes to `TopicId(type="research.facts.available", source="blog-post-autogen")`, the `AgentRuntime` will see that `writer_subscription` matches the `topic_type`. It will then use the rule: "Find (or create) an agent with `AgentId(type='writer', key='blog-post-autogen')` and deliver the message." **Benefit:** Decoupling! The Researcher just broadcasts. The Writer just listens for relevant broadcasts. We can add more listeners (like a `FactChecker` subscribing to the same `topic_type`) without changing the `Researcher` at all. ## Under the Hood: How Publishing Works Let's trace the journey of a published message. **Conceptual Flow:** ```mermaid sequenceDiagram participant Publisher as Publisher Agent participant Runtime as AgentRuntime participant SubRegistry as Subscription Registry participant Subscriber as Subscriber Agent Publisher->>+Runtime: publish_message(message, topic_id) Runtime->>+SubRegistry: Find subscriptions matching topic_id SubRegistry-->>-Runtime: Return list of matching Subscriptions loop For each matching Subscription Runtime->>Subscription: map_to_agent(topic_id) Subscription-->>Runtime: Return target AgentId Runtime->>+Subscriber: Locate/Create Agent instance by AgentId Runtime->>Subscriber: on_message(message, context) Subscriber-->>-Runtime: Process message (optional return) end Runtime-->>-Publisher: Return (usually None for publish) ``` 1. **Publish:** An agent calls `agent_context.publish_message(message, topic_id)`. This internally calls the `AgentRuntime`'s publish method. 2. **Lookup:** The `AgentRuntime` takes the `topic_id` and consults its internal `Subscription Registry`. 3. **Match:** The Registry checks all registered `Subscription` objects. Each `Subscription` has an `is_match(topic_id)` method. The registry finds all subscriptions where `is_match` returns `True`. 4. **Map:** For each matching `Subscription`, the Runtime calls its `map_to_agent(topic_id)` method. This method returns the specific `AgentId` that should handle this message based on the subscription rule and the topic details. 5. **Deliver:** The `AgentRuntime` finds the agent instance corresponding to the returned `AgentId` (potentially creating it if it doesn't exist yet, especially with `TypeSubscription`). It then calls that agent's `on_message` method, delivering the original published `message`. **Code Glimpse:** * **`TopicId` (`_topic.py`):** As shown before, a simple dataclass holding `type` and `source`. It includes validation to ensure the `type` follows certain naming conventions. ```python # From: _topic.py @dataclass(eq=True, frozen=True) class TopicId: type: str source: str # ... validation and __str__ ... @classmethod def from_str(cls, topic_id: str) -> Self: # Helper to parse "type/source" string # ... implementation ... ``` * **`Subscription` Protocol (`_subscription.py`):** This defines the *contract* for any subscription rule. ```python # From: _subscription.py (Simplified Protocol) from typing import Protocol # ... other imports class Subscription(Protocol): @property def id(self) -> str: ... # Unique ID for this subscription instance def is_match(self, topic_id: TopicId) -> bool: """Check if a topic matches this subscription's rule.""" ... def map_to_agent(self, topic_id: TopicId) -> AgentId: """Determine the target AgentId if is_match was True.""" ... ``` Any class implementing these methods can act as a subscription rule. * **`TypeSubscription` (`_type_subscription.py`):** A common implementation of the `Subscription` protocol. ```python # From: _type_subscription.py (Simplified) class TypeSubscription(Subscription): def __init__(self, topic_type: str, agent_type: str, ...): self._topic_type = topic_type self._agent_type = agent_type # ... generates a unique self._id ... def is_match(self, topic_id: TopicId) -> bool: # Matches if the topic's type is exactly the one we want return topic_id.type == self._topic_type def map_to_agent(self, topic_id: TopicId) -> AgentId: # Maps to an agent of the specified type, using the # topic's source as the agent's unique key. if not self.is_match(topic_id): raise CantHandleException(...) # Should not happen if used correctly return AgentId(type=self._agent_type, key=topic_id.source) # ... id property ... ``` This implementation provides the "one agent instance per source" behavior for a specific topic type. * **`DefaultSubscription` (`_default_subscription.py`):** This is often used via a decorator (`@default_subscription`) and provides a convenient way to create a `TypeSubscription` where the `agent_type` is automatically inferred from the agent class being defined, and the `topic_type` defaults to "default" (but can be overridden). It simplifies common use cases. ```python # From: _default_subscription.py (Conceptual Usage) from autogen_core import BaseAgent, default_subscription, ResearchFacts @default_subscription # Uses 'default' topic type, infers agent type 'writer' class WriterAgent(BaseAgent): # Agent logic here... async def on_message_impl(self, message: ResearchFacts, ctx): ... # Or specify the topic type @default_subscription(topic_type="research.facts.available") class SpecificWriterAgent(BaseAgent): # Agent logic here... async def on_message_impl(self, message: ResearchFacts, ctx): ... ``` The actual sending (`publish_message`) and routing logic reside within the `AgentRuntime`, which we'll explore next. ## Next Steps You've learned how AutoGen Core uses a publish/subscribe system (`TopicId`, `Subscription`) to allow agents to communicate without direct coupling. This is crucial for building flexible and scalable multi-agent applications. * **Topic (`TopicId`):** Named channels (`type`/`source`) for broadcasting messages. * **Publish:** Sending a message to a Topic. * **Subscription:** An agent's declared interest in messages on certain Topics, defining a routing rule. Now, let's dive into the orchestrator that manages agents and makes this messaging system work: * [Chapter 3: AgentRuntime](03_agentruntime.md): The manager responsible for creating, running, and connecting agents, including handling message publishing and subscription routing. --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/03_agentruntime.md ================================================ --- layout: default title: "AgentRuntime" parent: "AutoGen Core" nav_order: 3 --- # Chapter 3: AgentRuntime - The Office Manager In [Chapter 1: Agent](01_agent.md), we met the workers (`Agent`) of our system. In [Chapter 2: Messaging System](02_messaging_system__topic___subscription_.md), we saw how they can communicate broadly using topics and subscriptions. But who hires these agents? Who actually delivers the messages, whether direct or published? And who keeps the whole system running smoothly? This is where the **`AgentRuntime`** comes in. It's the central nervous system, the operating system, or perhaps the most fitting analogy: **the office manager** for all your agents. ## Motivation: Why Do We Need an Office Manager? Imagine an office full of employees (Agents). You have researchers, writers, maybe coders. * How does a new employee get hired and set up? * When one employee wants to send a memo directly to another, who makes sure it gets to the right desk? * When someone posts an announcement on the company bulletin board (publishes to a topic), who ensures everyone who signed up for that type of announcement sees it? * Who starts the workday and ensures everything keeps running? Without an office manager, it would be chaos! The `AgentRuntime` serves this crucial role in AutoGen Core. It handles: 1. **Agent Creation:** "Onboarding" new agents when they are needed. 2. **Message Routing:** Delivering direct messages (`send_message`) and published messages (`publish_message`). 3. **Lifecycle Management:** Starting, running, and stopping the whole system. 4. **State Management:** Keeping track of the overall system state (optional). ## Key Concepts: Understanding the Manager's Job Let's break down the main responsibilities of the `AgentRuntime`: 1. **Agent Instantiation (Hiring):** * You don't usually create agent objects directly (like `my_agent = ResearcherAgent()`). Why? Because the agent needs to know *about* the runtime (the office it works in) to send messages, publish announcements, etc. * Instead, you tell the `AgentRuntime`: "I need an agent of type 'researcher'. Here's a recipe (a **factory function**) for how to create one." This is done using `runtime.register_factory(...)`. * When a message needs to go to a 'researcher' agent with a specific key (e.g., 'researcher-01'), the runtime checks if it already exists. If not, it uses the registered factory function to create (instantiate) the agent. * **Crucially**, while creating the agent, the runtime provides special context (`AgentInstantiationContext`) so the new agent automatically gets its unique `AgentId` and a reference to the `AgentRuntime` itself. This is like giving a new employee their ID badge and telling them who the office manager is. ```python # Simplified Concept - How a BaseAgent gets its ID and runtime access # From: _agent_instantiation.py and _base_agent.py # Inside the agent's __init__ method (when inheriting from BaseAgent): class MyAgent(BaseAgent): def __init__(self, description: str): # This magic happens *because* the AgentRuntime is creating the agent # inside a special context. self._runtime = AgentInstantiationContext.current_runtime() # Gets the manager self._id = AgentInstantiationContext.current_agent_id() # Gets its own ID self._description = description # ... rest of initialization ... ``` This ensures agents are properly integrated into the system from the moment they are created. 2. **Message Delivery (Mail Room):** * **Direct Send (`send_message`):** When an agent calls `await agent_context.send_message(message, recipient_id)`, it's actually telling the `AgentRuntime`, "Please deliver this `message` directly to the agent identified by `recipient_id`." The runtime finds the recipient agent (creating it if necessary) and calls its `on_message` method. It's like putting a specific name on an envelope and handing it to the mail room. * **Publish (`publish_message`):** When an agent calls `await agent_context.publish_message(message, topic_id)`, it tells the runtime, "Post this `message` to the announcement board named `topic_id`." The runtime then checks its list of **subscriptions** (who signed up for which boards). For every matching subscription, it figures out the correct recipient agent(s) (based on the subscription rule) and delivers the message to their `on_message` method. 3. **Lifecycle Management (Opening/Closing the Office):** * The runtime needs to be started to begin processing messages. Typically, you call `runtime.start()`. This usually kicks off a background process or loop that watches for incoming messages. * When work is done, you need to stop the runtime gracefully. `runtime.stop_when_idle()` is common – it waits until all messages currently in the queue have been processed, then stops. `runtime.stop()` stops more abruptly. 4. **State Management (Office Records):** * The runtime can save the state of *all* the agents it manages (`runtime.save_state()`) and load it back later (`runtime.load_state()`). This is useful for pausing and resuming complex multi-agent interactions. It can also save/load state for individual agents (`runtime.agent_save_state()` / `runtime.agent_load_state()`). We'll touch more on state in [Chapter 7: Memory](07_memory.md). ## Use Case Example: Running Our Researcher and Writer Let's finally run the Researcher/Writer scenario from Chapters 1 and 2. We need the `AgentRuntime` to make it happen. **Goal:** 1. Create a runtime. 2. Register factories for a 'researcher' and a 'writer' agent. 3. Tell the runtime that 'writer' agents are interested in "research.facts.available" topics (add subscription). 4. Start the runtime. 5. Send an initial `ResearchTopic` message to a 'researcher' agent. 6. Let the system run (Researcher publishes facts, Runtime delivers to Writer via subscription, Writer processes). 7. Stop the runtime when idle. **Code Snippets (Simplified):** ```python # 0. Imports and Message Definitions (from previous chapters) import asyncio from dataclasses import dataclass from autogen_core import ( AgentId, BaseAgent, SingleThreadedAgentRuntime, TopicId, MessageContext, TypeSubscription, AgentInstantiationContext ) @dataclass class ResearchTopic: topic: str @dataclass class ResearchFacts: topic: str; facts: list[str] ``` These are the messages our agents will exchange. ```python # 1. Define Agent Logic (using BaseAgent) class ResearcherAgent(BaseAgent): async def on_message_impl(self, message: ResearchTopic, ctx: MessageContext): print(f"Researcher ({self.id}) got topic: {message.topic}") facts = [f"Fact 1 about {message.topic}", f"Fact 2"] results_topic = TopicId("research.facts.available", message.topic) # Use the runtime (via self.publish_message helper) to publish await self.publish_message( ResearchFacts(topic=message.topic, facts=facts), results_topic ) print(f"Researcher ({self.id}) published facts to {results_topic}") class WriterAgent(BaseAgent): async def on_message_impl(self, message: ResearchFacts, ctx: MessageContext): print(f"Writer ({self.id}) received facts via topic '{ctx.topic_id}': {message.facts}") draft = f"Draft for {message.topic}: {'; '.join(message.facts)}" print(f"Writer ({self.id}) created draft: '{draft}'") # This agent doesn't send further messages in this example ``` Here we define the behavior of our two agent types, inheriting from `BaseAgent` which gives us `self.id`, `self.publish_message`, etc. ```python # 2. Define Agent Factories def researcher_factory(): # Gets runtime/id via AgentInstantiationContext inside BaseAgent.__init__ print("Runtime is creating a ResearcherAgent...") return ResearcherAgent(description="I research topics.") def writer_factory(): print("Runtime is creating a WriterAgent...") return WriterAgent(description="I write drafts from facts.") ``` These simple functions tell the runtime *how* to create instances of our agents when needed. ```python # 3. Setup and Run the Runtime async def main(): # Create the runtime (the office manager) runtime = SingleThreadedAgentRuntime() # Register the factories (tell the manager how to hire) await runtime.register_factory("researcher", researcher_factory) await runtime.register_factory("writer", writer_factory) print("Registered agent factories.") # Add the subscription (tell manager who listens to which announcements) # Rule: Messages to topics of type "research.facts.available" # should go to a "writer" agent whose key matches the topic source. writer_sub = TypeSubscription(topic_type="research.facts.available", agent_type="writer") await runtime.add_subscription(writer_sub) print(f"Added subscription: {writer_sub.id}") # Start the runtime (open the office) runtime.start() print("Runtime started.") # Send the initial message to kick things off research_task_topic = "AutoGen Agents" researcher_instance_id = AgentId(type="researcher", key=research_task_topic) print(f"Sending initial topic '{research_task_topic}' to {researcher_instance_id}") await runtime.send_message( message=ResearchTopic(topic=research_task_topic), recipient=researcher_instance_id, ) # Wait until all messages are processed (wait for work day to end) print("Waiting for runtime to become idle...") await runtime.stop_when_idle() print("Runtime stopped.") # Run the main function asyncio.run(main()) ``` This script sets up the `SingleThreadedAgentRuntime`, registers the blueprints (factories) and communication rules (subscription), starts the process, and then shuts down cleanly. **Expected Output (Conceptual Order):** ``` Registered agent factories. Added subscription: type=research.facts.available=>agent=writer Runtime started. Sending initial topic 'AutoGen Agents' to researcher/AutoGen Agents Waiting for runtime to become idle... Runtime is creating a ResearcherAgent... # First time researcher/AutoGen Agents is needed Researcher (researcher/AutoGen Agents) got topic: AutoGen Agents Researcher (researcher/AutoGen Agents) published facts to research.facts.available/AutoGen Agents Runtime is creating a WriterAgent... # First time writer/AutoGen Agents is needed (due to subscription) Writer (writer/AutoGen Agents) received facts via topic 'research.facts.available/AutoGen Agents': ['Fact 1 about AutoGen Agents', 'Fact 2'] Writer (writer/AutoGen Agents) created draft: 'Draft for AutoGen Agents: Fact 1 about AutoGen Agents; Fact 2' Runtime stopped. ``` You can see the runtime orchestrating the creation of agents and the flow of messages based on the initial request and the subscription rule. ## Under the Hood: How the Manager Works Let's peek inside the `SingleThreadedAgentRuntime` (a common implementation provided by AutoGen Core) to understand the flow. **Core Idea:** It uses an internal queue (`_message_queue`) to hold incoming requests (`send_message`, `publish_message`). A background task continuously takes items from the queue and processes them one by one (though the *handling* of a message might involve `await` and allow other tasks to run). **1. Agent Creation (`_get_agent`, `_invoke_agent_factory`)** When the runtime needs an agent instance (e.g., to deliver a message) that hasn't been created yet: ```mermaid sequenceDiagram participant Runtime as AgentRuntime participant Factory as Agent Factory Func participant AgentCtx as AgentInstantiationContext participant Agent as New Agent Instance Runtime->>Runtime: Check if agent instance exists (e.g., in `_instantiated_agents` dict) alt Agent Not Found Runtime->>Runtime: Find registered factory for agent type Runtime->>AgentCtx: Set current runtime & agent_id activate AgentCtx Runtime->>Factory: Call factory function() activate Factory Factory->>AgentCtx: (Inside Agent.__init__) Get current runtime AgentCtx-->>Factory: Return runtime Factory->>AgentCtx: (Inside Agent.__init__) Get current agent_id AgentCtx-->>Factory: Return agent_id Factory-->>Runtime: Return new Agent instance deactivate Factory Runtime->>AgentCtx: Clear context deactivate AgentCtx Runtime->>Runtime: Store new agent instance end Runtime->>Runtime: Return agent instance ``` * The runtime looks up the factory function registered for the required `AgentId.type`. * It uses `AgentInstantiationContext.populate_context` to temporarily store its own reference and the target `AgentId`. * It calls the factory function. * Inside the agent's `__init__` (usually via `BaseAgent`), `AgentInstantiationContext.current_runtime()` and `AgentInstantiationContext.current_agent_id()` are called to retrieve the context set by the runtime. * The factory returns the fully initialized agent instance. * The runtime stores this instance for future use. ```python # From: _agent_instantiation.py (Simplified) class AgentInstantiationContext: _CONTEXT_VAR = ContextVar("agent_context") # Stores (runtime, agent_id) @classmethod @contextmanager def populate_context(cls, ctx: tuple[AgentRuntime, AgentId]): token = cls._CONTEXT_VAR.set(ctx) # Store context for this block try: yield # Code inside the 'with' block runs here finally: cls._CONTEXT_VAR.reset(token) # Clean up context @classmethod def current_runtime(cls) -> AgentRuntime: return cls._CONTEXT_VAR.get()[0] # Retrieve runtime from context @classmethod def current_agent_id(cls) -> AgentId: return cls._CONTEXT_VAR.get()[1] # Retrieve agent_id from context ``` This context manager pattern ensures the correct runtime and ID are available *only* during the agent's creation by the runtime. **2. Direct Messaging (`send_message` -> `_process_send`)** ```mermaid sequenceDiagram participant Sender as Sending Agent/Code participant Runtime as AgentRuntime participant Queue as Internal Queue participant Recipient as Recipient Agent Sender->>+Runtime: send_message(msg, recipient_id, ...) Runtime->>Runtime: Create Future (for response) Runtime->>+Queue: Put SendMessageEnvelope(msg, recipient_id, future) Runtime-->>-Sender: Return awaitable Future Note over Queue, Runtime: Background task picks up envelope Runtime->>Runtime: _process_send(envelope) Runtime->>+Recipient: _get_agent(recipient_id) (creates if needed) Recipient-->>-Runtime: Return Agent instance Runtime->>+Recipient: on_message(msg, context) Recipient->>Recipient: Process message... Recipient-->>-Runtime: Return response value Runtime->>Runtime: Set Future result with response value ``` * `send_message` creates a `Future` object (a placeholder for the eventual result) and wraps the message details in a `SendMessageEnvelope`. * This envelope is put onto the internal `_message_queue`. * The background task picks up the envelope. * `_process_send` gets the recipient agent instance (using `_get_agent`). * It calls the recipient's `on_message` method. * When `on_message` returns a result, `_process_send` sets the result on the `Future` object, which makes the original `await runtime.send_message(...)` call return the value. **3. Publish/Subscribe (`publish_message` -> `_process_publish`)** ```mermaid sequenceDiagram participant Publisher as Publishing Agent/Code participant Runtime as AgentRuntime participant Queue as Internal Queue participant SubManager as SubscriptionManager participant Subscriber as Subscribed Agent Publisher->>+Runtime: publish_message(msg, topic_id, ...) Runtime->>+Queue: Put PublishMessageEnvelope(msg, topic_id) Runtime-->>-Publisher: Return (None for publish) Note over Queue, Runtime: Background task picks up envelope Runtime->>Runtime: _process_publish(envelope) Runtime->>+SubManager: get_subscribed_recipients(topic_id) SubManager->>SubManager: Find matching subscriptions SubManager->>SubManager: Map subscriptions to AgentIds SubManager-->>-Runtime: Return list of recipient AgentIds loop For each recipient AgentId Runtime->>+Subscriber: _get_agent(recipient_id) (creates if needed) Subscriber-->>-Runtime: Return Agent instance Runtime->>+Subscriber: on_message(msg, context with topic_id) Subscriber->>Subscriber: Process message... Subscriber-->>-Runtime: Return (usually None for publish) end ``` * `publish_message` wraps the message in a `PublishMessageEnvelope` and puts it on the queue. * The background task picks it up. * `_process_publish` asks the `SubscriptionManager` (`_subscription_manager`) for all `AgentId`s that are subscribed to the given `topic_id`. * The `SubscriptionManager` checks its registered `Subscription` objects (`_subscriptions` list, added via `add_subscription`). For each `Subscription` where `is_match(topic_id)` is true, it calls `map_to_agent(topic_id)` to get the target `AgentId`. * For each resulting `AgentId`, the runtime gets the agent instance and calls its `on_message` method, providing the `topic_id` in the `MessageContext`. ```python # From: _runtime_impl_helpers.py (SubscriptionManager simplified) class SubscriptionManager: def __init__(self): self._subscriptions: List[Subscription] = [] # Optimization cache can be added here async def add_subscription(self, subscription: Subscription): self._subscriptions.append(subscription) # Clear cache if any async def get_subscribed_recipients(self, topic: TopicId) -> List[AgentId]: recipients = [] for sub in self._subscriptions: if sub.is_match(topic): recipients.append(sub.map_to_agent(topic)) return recipients ``` The `SubscriptionManager` simply iterates through registered subscriptions to find matches when a message is published. ## Next Steps You now understand the `AgentRuntime` - the essential coordinator that brings Agents to life, manages their communication, and runs the entire show. It handles agent creation via factories, routes direct and published messages, and manages the system's lifecycle. With the core concepts of `Agent`, `Messaging`, and `AgentRuntime` covered, we can start looking at more specialized building blocks. Next, we'll explore how agents can use external capabilities: * [Chapter 4: Tool](04_tool.md): How to give agents tools (like functions or APIs) to perform specific actions beyond just processing messages. --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/04_tool.md ================================================ --- layout: default title: "Tool" parent: "AutoGen Core" nav_order: 4 --- # Chapter 4: Tool - Giving Agents Specific Capabilities In the previous chapters, we learned about Agents as workers ([Chapter 1](01_agent.md)), how they can communicate directly or using announcements ([Chapter 2](02_messaging_system__topic___subscription_.md)), and the `AgentRuntime` that manages them ([Chapter 3](03_agentruntime.md)). Agents can process messages and coordinate, but what if an agent needs to perform a very specific action, like looking up information online, running a piece of code, accessing a database, or even just finding out the current date? They need specialized *capabilities*. This is where the concept of a **Tool** comes in. ## Motivation: Agents Need Skills! Imagine our `Writer` agent from before. It receives facts and writes a draft. Now, let's say we want the `Writer` (or perhaps a smarter `Assistant` agent helping it) to always include the current date in the blog post title. How does the agent get the current date? It doesn't inherently know it. It needs a specific *skill* or *tool* for that. A `Tool` in AutoGen Core represents exactly this: a specific, well-defined capability that an Agent can use. Think of it like giving an employee (Agent) a specialized piece of equipment (Tool), like a calculator, a web browser, or a calendar lookup program. ## Key Concepts: Understanding Tools Let's break down what defines a Tool: 1. **It's a Specific Capability:** A Tool performs one well-defined task. Examples: * `search_web(query: str)` * `run_python_code(code: str)` * `get_stock_price(ticker: str)` * `get_current_date()` 2. **It Has a Schema (The Manual):** This is crucial! For an Agent (especially one powered by a Large Language Model - LLM) to know *when* and *how* to use a tool, the tool needs a clear description or "manual". This is called the `ToolSchema`. It typically includes: * **`name`**: A unique identifier for the tool (e.g., `get_current_date`). * **`description`**: A clear explanation of what the tool does, which helps the LLM decide if this tool is appropriate for the current task (e.g., "Fetches the current date in YYYY-MM-DD format"). * **`parameters`**: Defines what inputs the tool needs. This is itself a schema (`ParametersSchema`) describing the input fields, their types, and which ones are required. For our `get_current_date` example, it might need no parameters. For `get_stock_price`, it would need a `ticker` parameter of type string. ```python # From: tools/_base.py (Simplified Concept) from typing import TypedDict, Dict, Any, Sequence, NotRequired class ParametersSchema(TypedDict): type: str # Usually "object" properties: Dict[str, Any] # Defines input fields and their types required: NotRequired[Sequence[str]] # List of required field names class ToolSchema(TypedDict): name: str description: NotRequired[str] parameters: NotRequired[ParametersSchema] # 'strict' flag also possible (Chapter 5 related) ``` This schema allows an LLM to understand: "Ah, there's a tool called `get_current_date` that takes no inputs and gives me the current date. I should use that now!" 3. **It Can Be Executed:** Once an agent decides to use a tool (often based on the schema), there needs to be a mechanism to actually *run* the tool's underlying function and get the result. ## Use Case Example: Adding a `get_current_date` Tool Let's equip an agent with the ability to find the current date. **Goal:** Define a tool that gets the current date and show how it could be executed by a specialized agent. **Step 1: Define the Python Function** First, we need the actual Python code that performs the action. ```python # File: get_date_function.py import datetime def get_current_date() -> str: """Fetches the current date as a string.""" today = datetime.date.today() return today.isoformat() # Returns date like "2023-10-27" # Test the function print(f"Function output: {get_current_date()}") ``` This is a standard Python function. It takes no arguments and returns the date as a string. **Step 2: Wrap it as a `FunctionTool`** AutoGen Core provides a convenient way to turn a Python function like this into a `Tool` object using `FunctionTool`. It automatically inspects the function's signature (arguments and return type) and docstring to help build the `ToolSchema`. ```python # File: create_date_tool.py from autogen_core.tools import FunctionTool from get_date_function import get_current_date # Import our function # Create the Tool instance # We provide the function and a clear description for the LLM date_tool = FunctionTool( func=get_current_date, description="Use this tool to get the current date in YYYY-MM-DD format." # Name defaults to function name 'get_current_date' ) # Let's see what FunctionTool generated print(f"Tool Name: {date_tool.name}") print(f"Tool Description: {date_tool.description}") # The schema defines inputs (none in this case) # print(f"Tool Schema Parameters: {date_tool.schema['parameters']}") # Output (simplified): {'type': 'object', 'properties': {}, 'required': []} ``` `FunctionTool` wraps our `get_current_date` function. It uses the function name as the tool name and the description we provided. It also correctly determines from the function signature that there are no input parameters (`properties: {}`). **Step 3: How an Agent Might Request Tool Use** Now we have a `date_tool`. How is it used? Typically, an LLM-powered agent (which we'll see more of in [Chapter 5: ChatCompletionClient](05_chatcompletionclient.md)) analyzes a request and decides a tool is needed. It then generates a request to *call* that tool, often using a specific message type like `FunctionCall`. ```python # File: tool_call_request.py from autogen_core import FunctionCall # Represents a request to call a tool # Imagine an LLM agent decided to use the date tool. # It constructs this message, providing the tool name and arguments (as JSON string). date_call_request = FunctionCall( id="call_date_001", # A unique ID for this specific call attempt name="get_current_date", # Matches the Tool's name arguments="{}" # An empty JSON object because no arguments are needed ) print("FunctionCall message:", date_call_request) # Output: FunctionCall(id='call_date_001', name='get_current_date', arguments='{}') ``` This `FunctionCall` message is like a work order: "Please execute the tool named `get_current_date` with these arguments." **Step 4: The `ToolAgent` Executes the Tool** Who receives this `FunctionCall` message? Usually, a specialized agent called `ToolAgent`. You create a `ToolAgent` and give it the list of tools it knows how to execute. When it receives a `FunctionCall`, it finds the matching tool and runs it. ```python # File: tool_agent_example.py import asyncio from autogen_core.tool_agent import ToolAgent from autogen_core.models import FunctionExecutionResult from create_date_tool import date_tool # Import the tool we created from tool_call_request import date_call_request # Import the request message # Create an agent specifically designed to execute tools tool_executor = ToolAgent( description="I can execute tools like getting the date.", tools=[date_tool] # Give it the list of tools it manages ) # --- Simulation of Runtime delivering the message --- # In a real app, the AgentRuntime (Chapter 3) would route the # date_call_request message to this tool_executor agent. # We simulate the call to its message handler here: async def simulate_execution(): # Fake context (normally provided by runtime) class MockContext: cancellation_token = None ctx = MockContext() print(f"ToolAgent received request: {date_call_request.name}") result: FunctionExecutionResult = await tool_executor.handle_function_call( message=date_call_request, ctx=ctx ) print(f"ToolAgent produced result: {result}") asyncio.run(simulate_execution()) ``` **Expected Output:** ``` ToolAgent received request: get_current_date ToolAgent produced result: FunctionExecutionResult(content='2023-10-27', call_id='call_date_001', is_error=False, name='get_current_date') # Date will be current date ``` The `ToolAgent` received the `FunctionCall`, found the `date_tool` in its list, executed the underlying `get_current_date` function, and packaged the result (the date string) into a `FunctionExecutionResult` message. This result message can then be sent back to the agent that originally requested the tool use. ## Under the Hood: How Tool Execution Works Let's visualize the typical flow when an LLM agent decides to use a tool managed by a `ToolAgent`. **Conceptual Flow:** ```mermaid sequenceDiagram participant LLMA as LLM Agent (Decides) participant Caller as Caller Agent (Orchestrates) participant ToolA as ToolAgent (Executes) participant ToolFunc as Tool Function (e.g., get_current_date) Note over LLMA: Analyzes conversation, decides tool needed. LLMA->>Caller: Sends AssistantMessage containing FunctionCall(name='get_current_date', args='{}') Note over Caller: Receives LLM response, sees FunctionCall. Caller->>+ToolA: Uses runtime.send_message(message=FunctionCall, recipient=ToolAgent_ID) Note over ToolA: Receives FunctionCall via on_message. ToolA->>ToolA: Looks up 'get_current_date' in its internal list of Tools. ToolA->>+ToolFunc: Calls tool.run_json(args={}) -> triggers get_current_date() ToolFunc-->>-ToolA: Returns the result (e.g., "2023-10-27") ToolA->>ToolA: Creates FunctionExecutionResult message with the content. ToolA-->>-Caller: Returns FunctionExecutionResult via runtime messaging. Note over Caller: Receives the tool result. Caller->>LLMA: Sends FunctionExecutionResultMessage to LLM for next step. Note over LLMA: Now knows the current date. ``` 1. **Decision:** An LLM-powered agent decides a tool is needed based on the conversation and the available tools' descriptions. It generates a `FunctionCall`. 2. **Request:** A "Caller" agent (often the same LLM agent or a managing agent) sends this `FunctionCall` message to the dedicated `ToolAgent` using the `AgentRuntime`. 3. **Lookup:** The `ToolAgent` receives the message, extracts the tool `name` (`get_current_date`), and finds the corresponding `Tool` object (our `date_tool`) in the list it was configured with. 4. **Execution:** The `ToolAgent` calls the `run_json` method on the `Tool` object, passing the arguments from the `FunctionCall`. For a `FunctionTool`, `run_json` validates the arguments against the generated schema and then executes the original Python function (`get_current_date`). 5. **Result:** The Python function returns its result (the date string). 6. **Response:** The `ToolAgent` wraps this result string in a `FunctionExecutionResult` message, including the original `call_id`, and sends it back to the Caller agent. 7. **Continuation:** The Caller agent typically sends this result back to the LLM agent, allowing the conversation or task to continue with the new information. **Code Glimpse:** * **`Tool` Protocol (`tools/_base.py`):** Defines the basic contract any tool must fulfill. Key methods are `schema` (property returning the `ToolSchema`) and `run_json` (method to execute the tool with JSON-like arguments). * **`BaseTool` (`tools/_base.py`):** An abstract class that helps implement the `Tool` protocol, especially using Pydantic models for defining arguments (`args_type`) and return values (`return_type`). It automatically generates the `parameters` part of the schema from the `args_type` model. * **`FunctionTool` (`tools/_function_tool.py`):** Inherits from `BaseTool`. Its magic lies in automatically creating the `args_type` Pydantic model by inspecting the wrapped Python function's signature (`args_base_model_from_signature`). Its `run` method handles calling the original sync or async Python function. ```python # Inside FunctionTool (Simplified Concept) class FunctionTool(BaseTool[BaseModel, BaseModel]): def __init__(self, func, description, ...): self._func = func self._signature = get_typed_signature(func) # Automatically create Pydantic model for arguments args_model = args_base_model_from_signature(...) # Get return type from signature return_type = self._signature.return_annotation super().__init__(args_model, return_type, ...) async def run(self, args: BaseModel, ...): # Extract arguments from the 'args' model kwargs = args.model_dump() # Call the original Python function (sync or async) result = await self._call_underlying_func(**kwargs) return result # Must match the expected return_type ``` * **`ToolAgent` (`tool_agent/_tool_agent.py`):** A specialized `RoutedAgent`. It registers a handler specifically for `FunctionCall` messages. ```python # Inside ToolAgent (Simplified Concept) class ToolAgent(RoutedAgent): def __init__(self, ..., tools: List[Tool]): super().__init__(...) self._tools = {tool.name: tool for tool in tools} # Store tools by name @message_handler # Registers this for FunctionCall messages async def handle_function_call(self, message: FunctionCall, ctx: MessageContext): # Find the tool by name tool = self._tools.get(message.name) if tool is None: # Handle error: Tool not found raise ToolNotFoundException(...) try: # Parse arguments string into a dictionary arguments = json.loads(message.arguments) # Execute the tool's run_json method result_obj = await tool.run_json(args=arguments, ...) # Convert result object back to string if needed result_str = tool.return_value_as_string(result_obj) # Create the success result message return FunctionExecutionResult(content=result_str, ...) except Exception as e: # Handle execution errors return FunctionExecutionResult(content=f"Error: {e}", is_error=True, ...) ``` Its core logic is: find tool -> parse args -> run tool -> return result/error. ## Next Steps You've learned how **Tools** provide specific capabilities to Agents, defined by a **Schema** that LLMs can understand. We saw how `FunctionTool` makes it easy to wrap existing Python functions and how `ToolAgent` acts as the executor for these tools. This ability for agents to use tools is fundamental to building powerful and versatile AI systems that can interact with the real world or perform complex calculations. Now that agents can use tools, we need to understand more about the agents that *decide* which tools to use, which often involves interacting with Large Language Models: * [Chapter 5: ChatCompletionClient](05_chatcompletionclient.md): How agents interact with LLMs like GPT to generate responses or decide on actions (like calling a tool). * [Chapter 6: ChatCompletionContext](06_chatcompletioncontext.md): How the history of the conversation, including tool calls and results, is managed when talking to an LLM. --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/05_chatcompletionclient.md ================================================ --- layout: default title: "ChatCompletionClient" parent: "AutoGen Core" nav_order: 5 --- # Chapter 5: ChatCompletionClient - Talking to the Brains So far, we've learned about: * [Agents](01_agent.md): The workers in our system. * [Messaging](02_messaging_system__topic___subscription_.md): How agents communicate broadly. * [AgentRuntime](03_agentruntime.md): The manager that runs the show. * [Tools](04_tool.md): How agents get specific skills. But how does an agent actually *think* or *generate text*? Many powerful agents rely on Large Language Models (LLMs) – think of models like GPT-4, Claude, or Gemini – as their "brains". How does an agent in AutoGen Core communicate with these external LLM services? This is where the **`ChatCompletionClient`** comes in. It's the dedicated component for talking to LLMs. ## Motivation: Bridging the Gap to LLMs Imagine you want to build an agent that can summarize long articles. 1. You give the agent an article (as a message). 2. The agent needs to send this article to an LLM (like GPT-4). 3. It also needs to tell the LLM: "Please summarize this." 4. The LLM processes the request and generates a summary. 5. The agent needs to receive this summary back from the LLM. How does the agent handle the technical details of connecting to the LLM's specific API, formatting the request correctly, sending it over the internet, and understanding the response? The `ChatCompletionClient` solves this! Think of it as the **standard phone line and translator** connecting your agent to the LLM service. You tell the client *what* to say (the conversation history and instructions), and it handles *how* to say it to the specific LLM and translates the LLM's reply back into a standard format. ## Key Concepts: Understanding the LLM Communicator Let's break down the `ChatCompletionClient`: 1. **LLM Communication Bridge:** It's the primary way AutoGen agents interact with external LLM APIs (like OpenAI, Anthropic, Google Gemini, etc.). It hides the complexity of specific API calls. 2. **Standard Interface (`create` method):** It defines a common way to send requests and receive responses, regardless of the underlying LLM. The core method is `create`. You give it: * `messages`: A list of messages representing the conversation history so far. * Optional `tools`: A list of tools ([Chapter 4](04_tool.md)) the LLM might be able to use. * Other parameters (like `json_output` hints, `cancellation_token`). 3. **Messages (`LLMMessage`):** The conversation history is passed as a sequence of specific message types defined in `autogen_core.models`: * `SystemMessage`: Instructions for the LLM (e.g., "You are a helpful assistant."). * `UserMessage`: Input from the user or another agent (e.g., the article text). * `AssistantMessage`: Previous responses from the LLM (can include text or requests to call functions/tools). * `FunctionExecutionResultMessage`: The results of executing a tool/function call. 4. **Tools (`ToolSchema`):** You can provide the schemas of available tools ([Chapter 4](04_tool.md)). The LLM might then respond not with text, but with a request to call one of these tools (`FunctionCall` inside an `AssistantMessage`). 5. **Response (`CreateResult`):** The `create` method returns a standard `CreateResult` object containing: * `content`: The LLM's generated text or a list of `FunctionCall` requests. * `finish_reason`: Why the LLM stopped generating (e.g., "stop", "length", "function_calls"). * `usage`: How many input (`prompt_tokens`) and output (`completion_tokens`) tokens were used. * `cached`: Whether the response came from a cache. 6. **Token Tracking:** The client automatically tracks token usage (`prompt_tokens`, `completion_tokens`) for each call. You can query the total usage via methods like `total_usage()`. This is vital for monitoring costs, as most LLM APIs charge based on tokens. ## Use Case Example: Summarizing Text with an LLM Let's build a simplified scenario where we use a `ChatCompletionClient` to ask an LLM to summarize text. **Goal:** Send text to an LLM via a client and get a summary back. **Step 1: Prepare the Input Messages** We need to structure our request as a list of `LLMMessage` objects. ```python # File: prepare_messages.py from autogen_core.models import SystemMessage, UserMessage # Instructions for the LLM system_prompt = SystemMessage( content="You are a helpful assistant designed to summarize text concisely." ) # The text we want to summarize article_text = """ AutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and can seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools. """ user_request = UserMessage( content=f"Please summarize the following text in one sentence:\n\n{article_text}", source="User" # Indicate who provided this input ) # Combine into a list for the client messages_to_send = [system_prompt, user_request] print("Messages prepared:") for msg in messages_to_send: print(f"- {msg.type}: {msg.content[:50]}...") # Print first 50 chars ``` This code defines the instructions (`SystemMessage`) and the user's request (`UserMessage`) and puts them in a list, ready to be sent. **Step 2: Use the ChatCompletionClient (Conceptual)** Now, we need an instance of a `ChatCompletionClient`. In a real application, you'd configure a specific client (like `OpenAIChatCompletionClient` with your API key). For this example, let's imagine we have a pre-configured client called `llm_client`. ```python # File: call_llm_client.py import asyncio from autogen_core.models import CreateResult, RequestUsage # Assume 'messages_to_send' is from the previous step # Assume 'llm_client' is a pre-configured ChatCompletionClient instance # (e.g., llm_client = OpenAIChatCompletionClient(config=...)) async def get_summary(client, messages): print("\nSending messages to LLM via ChatCompletionClient...") try: # The core call: send messages, get structured result response: CreateResult = await client.create( messages=messages, # We aren't providing tools in this simple example tools=[] ) print("Received response:") print(f"- Finish Reason: {response.finish_reason}") print(f"- Content: {response.content}") # This should be the summary print(f"- Usage (Tokens): Prompt={response.usage.prompt_tokens}, Completion={response.usage.completion_tokens}") print(f"- Cached: {response.cached}") # Also, check total usage tracked by the client total_usage = client.total_usage() print(f"\nClient Total Usage: Prompt={total_usage.prompt_tokens}, Completion={total_usage.completion_tokens}") except Exception as e: print(f"An error occurred: {e}") # --- Placeholder for actual client --- class MockChatCompletionClient: # Simulate a real client _total_usage = RequestUsage(prompt_tokens=0, completion_tokens=0) async def create(self, messages, tools=[], **kwargs) -> CreateResult: # Simulate API call and response prompt_len = sum(len(str(m.content)) for m in messages) // 4 # Rough token estimate summary = "AutoGen is a multi-agent framework for developing LLM applications." completion_len = len(summary) // 4 # Rough token estimate usage = RequestUsage(prompt_tokens=prompt_len, completion_tokens=completion_len) self._total_usage.prompt_tokens += usage.prompt_tokens self._total_usage.completion_tokens += usage.completion_tokens return CreateResult( finish_reason="stop", content=summary, usage=usage, cached=False ) def total_usage(self) -> RequestUsage: return self._total_usage # Other required methods (count_tokens, model_info etc.) omitted for brevity async def main(): from prepare_messages import messages_to_send # Get messages from previous step mock_client = MockChatCompletionClient() await get_summary(mock_client, messages_to_send) # asyncio.run(main()) # If you run this, it uses the mock client ``` This code shows the essential `client.create(...)` call. We pass our `messages_to_send` and receive a `CreateResult`. We then print the summary (`response.content`) and the token usage reported for that specific call (`response.usage`) and the total tracked by the client (`client.total_usage()`). **How an Agent Uses It:** Typically, an agent's logic (e.g., inside its `on_message` handler) would: 1. Receive an incoming message (like the article to summarize). 2. Prepare the list of `LLMMessage` objects (including system prompts, history, and the new request). 3. Access a `ChatCompletionClient` instance (often provided during agent setup or accessed via its context). 4. Call `await client.create(...)`. 5. Process the `CreateResult` (e.g., extract the summary text, check for function calls if tools were provided). 6. Potentially send the result as a new message to another agent or return it. ## Under the Hood: How the Client Talks to the LLM What happens when you call `await client.create(...)`? **Conceptual Flow:** ```mermaid sequenceDiagram participant Agent as Agent Logic participant Client as ChatCompletionClient participant Formatter as API Formatter participant HTTP as HTTP Client participant LLM_API as External LLM API Agent->>+Client: create(messages, tools) Client->>+Formatter: Format messages & tools for specific API (e.g., OpenAI JSON format) Formatter-->>-Client: Return formatted request body Client->>+HTTP: Send POST request to LLM API endpoint with formatted body & API Key HTTP->>+LLM_API: Transmit request over network LLM_API->>LLM_API: Process request, generate completion/function call LLM_API-->>-HTTP: Return API response (e.g., JSON) HTTP-->>-Client: Receive HTTP response Client->>+Formatter: Parse API response (extract content, usage, finish_reason) Formatter-->>-Client: Return parsed data Client->>Client: Create standard CreateResult object Client-->>-Agent: Return CreateResult ``` 1. **Prepare:** The `ChatCompletionClient` takes the standard `LLMMessage` list and `ToolSchema` list. 2. **Format:** It translates these into the specific format required by the target LLM's API (e.g., the JSON structure expected by OpenAI's `/chat/completions` endpoint). This might involve renaming roles (like `SystemMessage` to `system`), formatting tool descriptions, etc. 3. **Request:** It uses an underlying HTTP client to send a network request (usually a POST request) to the LLM service's API endpoint, including the formatted data and authentication (like an API key). 4. **Wait & Receive:** It waits for the LLM service to process the request and send back a response over the network. 5. **Parse:** It receives the raw HTTP response (usually JSON) from the API. 6. **Standardize:** It parses this specific API response, extracting the generated text or function calls, token usage figures, finish reason, etc. 7. **Return:** It packages all this information into a standard `CreateResult` object and returns it to the calling agent code. **Code Glimpse:** * **`ChatCompletionClient` Protocol (`models/_model_client.py`):** This is the abstract base class (or protocol) defining the *contract* that all specific clients must follow. ```python # From: models/_model_client.py (Simplified ABC) from abc import ABC, abstractmethod from typing import Sequence, Optional, Mapping, Any, AsyncGenerator, Union from ._types import LLMMessage, CreateResult, RequestUsage from ..tools import Tool, ToolSchema from .. import CancellationToken class ChatCompletionClient(ABC): @abstractmethod async def create( self, messages: Sequence[LLMMessage], *, tools: Sequence[Tool | ToolSchema] = [], json_output: Optional[bool] = None, # Hint for JSON mode extra_create_args: Mapping[str, Any] = {}, # API-specific args cancellation_token: Optional[CancellationToken] = None, ) -> CreateResult: ... # The core method @abstractmethod def create_stream( self, # Similar to create, but yields results incrementally # ... parameters ... ) -> AsyncGenerator[Union[str, CreateResult], None]: ... @abstractmethod def total_usage(self) -> RequestUsage: ... # Get total tracked usage @abstractmethod def count_tokens(self, messages: Sequence[LLMMessage], *, tools: Sequence[Tool | ToolSchema] = []) -> int: ... # Estimate token count # Other methods like close(), actual_usage(), remaining_tokens(), model_info... ``` Concrete classes like `OpenAIChatCompletionClient`, `AnthropicChatCompletionClient` etc., implement these methods using the specific libraries and API calls for each service. * **`LLMMessage` Types (`models/_types.py`):** These define the structure of messages passed *to* the client. ```python # From: models/_types.py (Simplified) from pydantic import BaseModel from typing import List, Union, Literal from .. import FunctionCall # From Chapter 4 context class SystemMessage(BaseModel): content: str type: Literal["SystemMessage"] = "SystemMessage" class UserMessage(BaseModel): content: Union[str, List[Union[str, Image]]] # Can include images! source: str type: Literal["UserMessage"] = "UserMessage" class AssistantMessage(BaseModel): content: Union[str, List[FunctionCall]] # Can be text or function calls source: str type: Literal["AssistantMessage"] = "AssistantMessage" # FunctionExecutionResultMessage also exists here... ``` * **`CreateResult` (`models/_types.py`):** This defines the structure of the response *from* the client. ```python # From: models/_types.py (Simplified) from pydantic import BaseModel from dataclasses import dataclass from typing import Union, List, Optional from .. import FunctionCall @dataclass class RequestUsage: prompt_tokens: int completion_tokens: int FinishReasons = Literal["stop", "length", "function_calls", "content_filter", "unknown"] class CreateResult(BaseModel): finish_reason: FinishReasons content: Union[str, List[FunctionCall]] # LLM output usage: RequestUsage # Token usage for this call cached: bool # Optional fields like logprobs, thought... ``` Using these standard types ensures that agent logic can work consistently, even if you switch the underlying LLM service by using a different `ChatCompletionClient` implementation. ## Next Steps You now understand the role of `ChatCompletionClient` as the crucial link between AutoGen agents and the powerful capabilities of Large Language Models. It provides a standard way to send conversational history and tool definitions, receive generated text or function call requests, and track token usage. Managing the conversation history (`messages`) sent to the client is very important. How do you ensure the LLM has the right context, especially after tool calls have happened? * [Chapter 6: ChatCompletionContext](06_chatcompletioncontext.md): Learn how AutoGen helps manage the conversation history, including adding tool call requests and their results, before sending it to the `ChatCompletionClient`. --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/06_chatcompletioncontext.md ================================================ --- layout: default title: "ChatCompletionContext" parent: "AutoGen Core" nav_order: 6 --- # Chapter 6: ChatCompletionContext - Remembering the Conversation In [Chapter 5: ChatCompletionClient](05_chatcompletionclient.md), we learned how agents talk to Large Language Models (LLMs) using a `ChatCompletionClient`. We saw that we need to send a list of `messages` (the conversation history) to the LLM so it knows the context. But conversations can get very long! Imagine talking on the phone for an hour. Can you remember *every single word* that was said? Probably not. You remember the main points, the beginning, and what was said most recently. LLMs have a similar limitation – they can only pay attention to a certain amount of text at once (called the "context window"). If we send the *entire* history of a very long chat, it might be too much for the LLM, lead to errors, be slow, or cost more money (since many LLMs charge based on the amount of text). So, how do we smartly choose *which* parts of the conversation history to send? This is the problem that **`ChatCompletionContext`** solves. ## Motivation: Keeping LLM Conversations Focused Let's say we have a helpful assistant agent chatting with a user: 1. **User:** "Hi! Can you tell me about AutoGen?" 2. **Assistant:** "Sure! AutoGen is a framework..." (provides details) 3. **User:** "Thanks! Now, can you draft an email to my team about our upcoming meeting?" 4. **Assistant:** "Okay, what's the meeting about?" 5. **User:** "It's about the project planning for Q3." 6. **Assistant:** (Needs to draft the email) When the Assistant needs to draft the email (step 6), does it need the *exact* text from step 2 about what AutoGen is? Probably not. It definitely needs the instructions from step 3 and the topic from step 5. Maybe the initial greeting isn't super important either. `ChatCompletionContext` acts like a **smart transcript editor**. Before sending the history to the LLM via the `ChatCompletionClient`, it reviews the full conversation log and prepares a shorter, focused version containing only the messages it thinks are most relevant for the LLM's next response. ## Key Concepts: Managing the Chat History 1. **The Full Transcript Holder:** A `ChatCompletionContext` object holds the *complete* list of messages (`LLMMessage` objects like `SystemMessage`, `UserMessage`, `AssistantMessage` from Chapter 5) that have occurred in a specific conversation thread. You add new messages using its `add_message` method. 2. **The Smart View Generator (`get_messages`):** The core job of `ChatCompletionContext` is done by its `get_messages` method. When called, it looks at the *full* transcript it holds, but returns only a *subset* of those messages based on its specific strategy. This subset is what you'll actually send to the `ChatCompletionClient`. 3. **Different Strategies for Remembering:** Because different situations require different focus, AutoGen Core provides several `ChatCompletionContext` implementations (strategies): * **`UnboundedChatCompletionContext`:** The simplest (and sometimes riskiest!). It doesn't edit anything; `get_messages` just returns the *entire* history. Good for short chats, but can break with long ones. * **`BufferedChatCompletionContext`:** Like remembering only the last few things someone said. It keeps the most recent `N` messages (where `N` is the `buffer_size` you set). Good for focusing on recent interactions. * **`HeadAndTailChatCompletionContext`:** Tries to get the best of both worlds. It keeps the first few messages (the "head", maybe containing initial instructions) and the last few messages (the "tail", the recent context). It skips the messages in the middle. ## Use Case Example: Chatting with Different Memory Strategies Let's simulate adding messages to different context managers and see what `get_messages` returns. **Step 1: Define some messages** ```python # File: define_chat_messages.py from autogen_core.models import ( SystemMessage, UserMessage, AssistantMessage, LLMMessage ) from typing import List # The initial instruction for the assistant system_msg = SystemMessage(content="You are a helpful assistant.") # A sequence of user/assistant turns chat_sequence: List[LLMMessage] = [ UserMessage(content="What is AutoGen?", source="User"), AssistantMessage(content="AutoGen is a multi-agent framework...", source="Agent"), UserMessage(content="What can it do?", source="User"), AssistantMessage(content="It can build complex LLM apps.", source="Agent"), UserMessage(content="Thanks!", source="User") ] # Combine system message and the chat sequence full_history: List[LLMMessage] = [system_msg] + chat_sequence print(f"Total messages in full history: {len(full_history)}") # Output: Total messages in full history: 6 ``` We have a full history of 6 messages (1 system + 5 chat turns). **Step 2: Use `UnboundedChatCompletionContext`** This context keeps everything. ```python # File: use_unbounded_context.py import asyncio from define_chat_messages import full_history from autogen_core.model_context import UnboundedChatCompletionContext async def main(): # Create context and add all messages context = UnboundedChatCompletionContext() for msg in full_history: await context.add_message(msg) # Get the messages to send to the LLM messages_for_llm = await context.get_messages() print(f"--- Unbounded Context ({len(messages_for_llm)} messages) ---") for i, msg in enumerate(messages_for_llm): print(f"{i+1}. [{msg.type}]: {msg.content[:30]}...") # asyncio.run(main()) # If run ``` **Expected Output (Unbounded):** ``` --- Unbounded Context (6 messages) --- 1. [SystemMessage]: You are a helpful assistant.... 2. [UserMessage]: What is AutoGen?... 3. [AssistantMessage]: AutoGen is a multi-agent fram... 4. [UserMessage]: What can it do?... 5. [AssistantMessage]: It can build complex LLM apps... 6. [UserMessage]: Thanks!... ``` It returns all 6 messages, exactly as added. **Step 3: Use `BufferedChatCompletionContext`** Let's keep only the last 3 messages. ```python # File: use_buffered_context.py import asyncio from define_chat_messages import full_history from autogen_core.model_context import BufferedChatCompletionContext async def main(): # Keep only the last 3 messages context = BufferedChatCompletionContext(buffer_size=3) for msg in full_history: await context.add_message(msg) messages_for_llm = await context.get_messages() print(f"--- Buffered Context (buffer=3, {len(messages_for_llm)} messages) ---") for i, msg in enumerate(messages_for_llm): print(f"{i+1}. [{msg.type}]: {msg.content[:30]}...") # asyncio.run(main()) # If run ``` **Expected Output (Buffered):** ``` --- Buffered Context (buffer=3, 3 messages) --- 1. [UserMessage]: What can it do?... 2. [AssistantMessage]: It can build complex LLM apps... 3. [UserMessage]: Thanks!... ``` It only returns the last 3 messages from the full history. The system message and the first chat turn are omitted. **Step 4: Use `HeadAndTailChatCompletionContext`** Let's keep the first message (head=1) and the last two messages (tail=2). ```python # File: use_head_tail_context.py import asyncio from define_chat_messages import full_history from autogen_core.model_context import HeadAndTailChatCompletionContext async def main(): # Keep first 1 and last 2 messages context = HeadAndTailChatCompletionContext(head_size=1, tail_size=2) for msg in full_history: await context.add_message(msg) messages_for_llm = await context.get_messages() print(f"--- Head & Tail Context (h=1, t=2, {len(messages_for_llm)} messages) ---") for i, msg in enumerate(messages_for_llm): print(f"{i+1}. [{msg.type}]: {msg.content[:30]}...") # asyncio.run(main()) # If run ``` **Expected Output (Head & Tail):** ``` --- Head & Tail Context (h=1, t=2, 4 messages) --- 1. [SystemMessage]: You are a helpful assistant.... 2. [UserMessage]: Skipped 3 messages.... 3. [AssistantMessage]: It can build complex LLM apps... 4. [UserMessage]: Thanks!... ``` It keeps the very first message (`SystemMessage`), then inserts a placeholder telling the LLM that some messages were skipped, and finally includes the last two messages. This preserves the initial instruction and the most recent context. **Which one to choose?** It depends on your agent's task! * Simple Q&A? `Buffered` might be fine. * Following complex initial instructions? `HeadAndTail` or even `Unbounded` (if short) might be better. ## Under the Hood: How Context is Managed The core idea is defined by the `ChatCompletionContext` abstract base class. **Conceptual Flow:** ```mermaid sequenceDiagram participant Agent as Agent Logic participant Context as ChatCompletionContext participant FullHistory as Internal Message List Agent->>+Context: add_message(newMessage) Context->>+FullHistory: Append newMessage to list FullHistory-->>-Context: List updated Context-->>-Agent: Done Agent->>+Context: get_messages() Context->>+FullHistory: Read the full list FullHistory-->>-Context: Return full list Context->>Context: Apply Strategy (e.g., slice list for Buffered/HeadTail) Context-->>-Agent: Return selected list of messages ``` 1. **Adding:** When `add_message(message)` is called, the context simply appends the `message` to its internal list (`self._messages`). 2. **Getting:** When `get_messages()` is called: * The context accesses its internal `self._messages` list. * The specific implementation (`Unbounded`, `Buffered`, `HeadAndTail`) applies its logic to select which messages to return. * It returns the selected list. **Code Glimpse:** * **Base Class (`_chat_completion_context.py`):** Defines the structure and common methods. ```python # From: model_context/_chat_completion_context.py (Simplified) from abc import ABC, abstractmethod from typing import List from ..models import LLMMessage class ChatCompletionContext(ABC): component_type = "chat_completion_context" # Identifies this as a component type def __init__(self, initial_messages: List[LLMMessage] | None = None) -> None: # Holds the COMPLETE history self._messages: List[LLMMessage] = initial_messages or [] async def add_message(self, message: LLMMessage) -> None: """Add a message to the full context.""" self._messages.append(message) @abstractmethod async def get_messages(self) -> List[LLMMessage]: """Get the subset of messages based on the strategy.""" # Each subclass MUST implement this logic ... # Other methods like clear(), save_state(), load_state() exist too ``` The base class handles storing messages; subclasses define *how* to retrieve them. * **Unbounded (`_unbounded_chat_completion_context.py`):** The simplest implementation. ```python # From: model_context/_unbounded_chat_completion_context.py (Simplified) from typing import List from ._chat_completion_context import ChatCompletionContext from ..models import LLMMessage class UnboundedChatCompletionContext(ChatCompletionContext): async def get_messages(self) -> List[LLMMessage]: """Returns all messages.""" return self._messages # Just return the whole internal list ``` * **Buffered (`_buffered_chat_completion_context.py`):** Uses slicing to get the end of the list. ```python # From: model_context/_buffered_chat_completion_context.py (Simplified) from typing import List from ._chat_completion_context import ChatCompletionContext from ..models import LLMMessage, FunctionExecutionResultMessage class BufferedChatCompletionContext(ChatCompletionContext): def __init__(self, buffer_size: int, ...): super().__init__(...) self._buffer_size = buffer_size async def get_messages(self) -> List[LLMMessage]: """Get at most `buffer_size` recent messages.""" # Slice the list to get the last 'buffer_size' items messages = self._messages[-self._buffer_size :] # Special case: Avoid starting with a function result message if messages and isinstance(messages[0], FunctionExecutionResultMessage): messages = messages[1:] return messages ``` * **Head and Tail (`_head_and_tail_chat_completion_context.py`):** Combines slices from the beginning and end. ```python # From: model_context/_head_and_tail_chat_completion_context.py (Simplified) from typing import List from ._chat_completion_context import ChatCompletionContext from ..models import LLMMessage, UserMessage class HeadAndTailChatCompletionContext(ChatCompletionContext): def __init__(self, head_size: int, tail_size: int, ...): super().__init__(...) self._head_size = head_size self._tail_size = tail_size async def get_messages(self) -> List[LLMMessage]: head = self._messages[: self._head_size] # First 'head_size' items tail = self._messages[-self._tail_size :] # Last 'tail_size' items num_skipped = len(self._messages) - len(head) - len(tail) if num_skipped <= 0: # If no overlap or gap return self._messages else: # If messages were skipped placeholder = [UserMessage(content=f"Skipped {num_skipped} messages.", source="System")] # Combine head + placeholder + tail return head + placeholder + tail ``` These implementations provide different ways to manage the context window effectively. ## Putting it Together with ChatCompletionClient How does an agent use `ChatCompletionContext` with the `ChatCompletionClient` from Chapter 5? 1. An agent has an instance of a `ChatCompletionContext` (e.g., `BufferedChatCompletionContext`) to store its conversation history. 2. When the agent receives a new message (e.g., a `UserMessage`), it calls `await context.add_message(new_user_message)`. 3. To prepare for calling the LLM, the agent calls `messages_to_send = await context.get_messages()`. This gets the strategically selected subset of the history. 4. The agent then passes this list to the `ChatCompletionClient`: `response = await llm_client.create(messages=messages_to_send, ...)`. 5. When the LLM replies (e.g., with an `AssistantMessage`), the agent adds it back to the context: `await context.add_message(llm_response_message)`. This loop ensures that the history is continuously updated and intelligently trimmed before each call to the LLM. ## Next Steps You've learned how `ChatCompletionContext` helps manage the conversation history sent to LLMs, preventing context window overflows and keeping the interaction focused using different strategies (`Unbounded`, `Buffered`, `HeadAndTail`). This context management is a specific form of **memory**. Agents might need to remember things beyond just the chat history. How do they store general information, state, or knowledge over time? * [Chapter 7: Memory](07_memory.md): Explore the broader concept of Memory in AutoGen Core, which provides more general ways for agents to store and retrieve information. * [Chapter 8: Component](08_component.md): Understand how `ChatCompletionContext` fits into the general `Component` model, allowing configuration and integration within the AutoGen system. --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/07_memory.md ================================================ --- layout: default title: "Memory" parent: "AutoGen Core" nav_order: 7 --- # Chapter 7: Memory - The Agent's Notebook In [Chapter 6: ChatCompletionContext](06_chatcompletioncontext.md), we saw how agents manage the *short-term* history of a single conversation before talking to an LLM. It's like remembering what was just said in the last few minutes. But what if an agent needs to remember things for much longer, across *multiple* conversations or tasks? For example, imagine an assistant agent that learns your preferences: * You tell it: "Please always write emails in a formal style for me." * Weeks later, you ask it to draft a new email. How does it remember that preference? The short-term `ChatCompletionContext` might have forgotten the earlier instruction, especially if using a strategy like `BufferedChatCompletionContext`. The agent needs a **long-term memory**. This is where the **`Memory`** abstraction comes in. Think of it as the agent's **long-term notebook or database**. While `ChatCompletionContext` is the scratchpad for the current chat, `Memory` holds persistent information the agent can add to or look up later. ## Motivation: Remembering Across Conversations Our goal is to give an agent the ability to store a piece of information (like a user preference) and retrieve it later to influence its behavior, even in a completely new conversation. `Memory` provides the mechanism for this long-term storage and retrieval. ## Key Concepts: How the Notebook Works 1. **What it Stores (`MemoryContent`):** Agents can store various types of information in their memory. This could be: * Plain text notes (`text/plain`) * Structured data like JSON (`application/json`) * Even images (`image/*`) Each piece of information is wrapped in a `MemoryContent` object, which includes the data itself, its type (`mime_type`), and optional descriptive `metadata`. ```python # From: memory/_base_memory.py (Simplified Concept) from pydantic import BaseModel from typing import Any, Dict, Union # Represents one entry in the memory notebook class MemoryContent(BaseModel): content: Union[str, bytes, Dict[str, Any]] # The actual data mime_type: str # What kind of data (e.g., "text/plain") metadata: Dict[str, Any] | None = None # Extra info (optional) ``` This standard format helps manage different kinds of memories. 2. **Adding to Memory (`add`):** When an agent learns something important it wants to remember long-term (like the user's preferred style), it uses the `memory.add(content)` method. This is like writing a new entry in the notebook. 3. **Querying Memory (`query`):** When an agent needs to recall information, it can use `memory.query(query_text)`. This is like searching the notebook for relevant entries. How the search works depends on the specific memory implementation (it could be a simple text match, or a sophisticated vector search in more advanced memories). 4. **Updating Chat Context (`update_context`):** This is a crucial link! Before an agent talks to the LLM (using the `ChatCompletionClient` from [Chapter 5](05_chatcompletionclient.md)), it can use `memory.update_context(chat_context)` method. This method: * Looks at the current conversation (`chat_context`). * Queries the long-term memory (`Memory`) for relevant information. * Injects the retrieved memories *into* the `chat_context`, often as a `SystemMessage`. This way, the LLM gets the benefit of the long-term memory *in addition* to the short-term conversation history, right before generating its response. 5. **Different Memory Implementations:** Just like there are different `ChatCompletionContext` strategies, there can be different `Memory` implementations: * `ListMemory`: A very simple memory that stores everything in a Python list (like a simple chronological notebook). * *Future Possibilities*: More advanced implementations could use databases or vector stores for more efficient storage and retrieval of vast amounts of information. ## Use Case Example: Remembering User Preferences with `ListMemory` Let's implement our user preference use case using the simple `ListMemory`. **Goal:** 1. Create a `ListMemory`. 2. Add a user preference ("formal style") to it. 3. Start a *new* chat context. 4. Use `update_context` to inject the preference into the new chat context. 5. Show how the chat context looks *before* being sent to the LLM. **Step 1: Create the Memory** We'll use `ListMemory`, the simplest implementation provided by AutoGen Core. ```python # File: create_list_memory.py from autogen_core.memory import ListMemory # Create a simple list-based memory instance user_prefs_memory = ListMemory(name="user_preferences") print(f"Created memory: {user_prefs_memory.name}") print(f"Initial content: {user_prefs_memory.content}") # Output: # Created memory: user_preferences # Initial content: [] ``` We have an empty memory notebook named "user_preferences". **Step 2: Add the Preference** Let's add the user's preference as a piece of text memory. ```python # File: add_preference.py import asyncio from autogen_core.memory import MemoryContent # Assume user_prefs_memory exists from the previous step # Define the preference as MemoryContent preference = MemoryContent( content="User prefers all communication to be written in a formal style.", mime_type="text/plain", # It's just text metadata={"source": "user_instruction_conversation_1"} # Optional info ) async def add_to_memory(): # Add the content to our memory instance await user_prefs_memory.add(preference) print(f"Memory content after adding: {user_prefs_memory.content}") asyncio.run(add_to_memory()) # Output (will show the MemoryContent object): # Memory content after adding: [MemoryContent(content='User prefers...', mime_type='text/plain', metadata={'source': '...'})] ``` We've successfully written the preference into our `ListMemory` notebook. **Step 3: Start a New Chat Context** Imagine time passes, and the user starts a new conversation asking for an email draft. We create a fresh `ChatCompletionContext`. ```python # File: start_new_chat.py from autogen_core.model_context import UnboundedChatCompletionContext from autogen_core.models import UserMessage # Start a new, empty chat context for a new task new_chat_context = UnboundedChatCompletionContext() # Add the user's new request new_request = UserMessage(content="Draft an email to the team about the Q3 results.", source="User") # await new_chat_context.add_message(new_request) # In a real app, add the request print("Created a new, empty chat context.") # Output: Created a new, empty chat context. ``` This context currently *doesn't* know about the "formal style" preference stored in our long-term memory. **Step 4: Inject Memory into Chat Context** Before sending the `new_chat_context` to the LLM, we use `update_context` to bring in relevant long-term memories. ```python # File: update_chat_with_memory.py import asyncio # Assume user_prefs_memory exists (with the preference added) # Assume new_chat_context exists (empty or with just the new request) # Assume new_request exists async def main(): # --- This is where Memory connects to Chat Context --- print("Updating chat context with memory...") update_result = await user_prefs_memory.update_context(new_chat_context) print(f"Memories injected: {len(update_result.memories.results)}") # Now let's add the actual user request for this task await new_chat_context.add_message(new_request) # See what messages are now in the context messages_for_llm = await new_chat_context.get_messages() print("\nMessages to be sent to LLM:") for msg in messages_for_llm: print(f"- [{msg.type}]: {msg.content}") asyncio.run(main()) ``` **Expected Output:** ``` Updating chat context with memory... Memories injected: 1 Messages to be sent to LLM: - [SystemMessage]: Relevant memory content (in chronological order): 1. User prefers all communication to be written in a formal style. - [UserMessage]: Draft an email to the team about the Q3 results. ``` Look! The `ListMemory.update_context` method automatically queried the memory (in this simple case, it just takes *all* entries) and added a `SystemMessage` to the `new_chat_context`. This message explicitly tells the LLM about the stored preference *before* it sees the user's request to draft the email. **Step 5: (Conceptual) Sending to LLM** Now, if we were to send `messages_for_llm` to the `ChatCompletionClient` (Chapter 5): ```python # Conceptual code - Requires a configured client # response = await llm_client.create(messages=messages_for_llm) ``` The LLM would receive both the instruction about the formal style preference (from Memory) and the request to draft the email. It's much more likely to follow the preference now! **Step 6: Direct Query (Optional)** We can also directly query the memory if needed, without involving a chat context. ```python # File: query_memory.py import asyncio # Assume user_prefs_memory exists async def main(): # Query the memory (ListMemory returns all items regardless of query text) query_result = await user_prefs_memory.query("style preference") print("\nDirect query result:") for item in query_result.results: print(f"- Content: {item.content}, Type: {item.mime_type}") asyncio.run(main()) # Output: # Direct query result: # - Content: User prefers all communication to be written in a formal style., Type: text/plain ``` This shows how an agent could specifically look things up in its notebook. ## Under the Hood: How `ListMemory` Injects Context Let's trace the `update_context` call for `ListMemory`. **Conceptual Flow:** ```mermaid sequenceDiagram participant AgentLogic as Agent Logic participant ListMem as ListMemory participant InternalList as Memory's Internal List participant ChatCtx as ChatCompletionContext AgentLogic->>+ListMem: update_context(chat_context) ListMem->>+InternalList: Get all stored MemoryContent items InternalList-->>-ListMem: Return list of [pref_content] alt Memory list is NOT empty ListMem->>ListMem: Format memories into a single string (e.g., "1. pref_content") ListMem->>ListMem: Create SystemMessage with formatted string ListMem->>+ChatCtx: add_message(SystemMessage) ChatCtx-->>-ListMem: Context updated end ListMem->>ListMem: Create UpdateContextResult(memories=[pref_content]) ListMem-->>-AgentLogic: Return UpdateContextResult ``` 1. The agent calls `user_prefs_memory.update_context(new_chat_context)`. 2. The `ListMemory` instance accesses its internal `_contents` list. 3. It checks if the list is empty. If not: 4. It iterates through the `MemoryContent` items in the list. 5. It formats them into a numbered string (like "Relevant memory content...\n1. Item 1\n2. Item 2..."). 6. It creates a single `SystemMessage` containing this formatted string. 7. It calls `new_chat_context.add_message()` to add this `SystemMessage` to the chat history that will be sent to the LLM. 8. It returns an `UpdateContextResult` containing the list of memories it just processed. **Code Glimpse:** * **`Memory` Protocol (`memory/_base_memory.py`):** Defines the required methods for any memory implementation. ```python # From: memory/_base_memory.py (Simplified ABC) from abc import ABC, abstractmethod # ... other imports: MemoryContent, MemoryQueryResult, UpdateContextResult, ChatCompletionContext class Memory(ABC): component_type = "memory" @abstractmethod async def update_context(self, model_context: ChatCompletionContext) -> UpdateContextResult: ... @abstractmethod async def query(self, query: str | MemoryContent, ...) -> MemoryQueryResult: ... @abstractmethod async def add(self, content: MemoryContent, ...) -> None: ... @abstractmethod async def clear(self) -> None: ... @abstractmethod async def close(self) -> None: ... ``` Any class wanting to act as Memory must provide these methods. * **`ListMemory` Implementation (`memory/_list_memory.py`):** ```python # From: memory/_list_memory.py (Simplified) from typing import List # ... other imports: Memory, MemoryContent, ..., SystemMessage, ChatCompletionContext class ListMemory(Memory): def __init__(self, ..., memory_contents: List[MemoryContent] | None = None): # Stores memory items in a simple list self._contents: List[MemoryContent] = memory_contents or [] async def add(self, content: MemoryContent, ...) -> None: """Add new content to the internal list.""" self._contents.append(content) async def query(self, query: str | MemoryContent = "", ...) -> MemoryQueryResult: """Return all memories, ignoring the query.""" # Simple implementation: just return everything return MemoryQueryResult(results=self._contents) async def update_context(self, model_context: ChatCompletionContext) -> UpdateContextResult: """Add all memories as a SystemMessage to the chat context.""" if not self._contents: # Do nothing if memory is empty return UpdateContextResult(memories=MemoryQueryResult(results=[])) # Format all memories into a numbered list string memory_strings = [f"{i}. {str(mem.content)}" for i, mem in enumerate(self._contents, 1)] memory_context_str = "Relevant memory content...\n" + "\n".join(memory_strings) + "\n" # Add this string as a SystemMessage to the provided chat context await model_context.add_message(SystemMessage(content=memory_context_str)) # Return info about which memories were added return UpdateContextResult(memories=MemoryQueryResult(results=self._contents)) # ... clear(), close(), config methods ... ``` This shows the straightforward logic of `ListMemory`: store in a list, retrieve the whole list, and inject the whole list as a single system message into the chat context. More complex memories might use smarter retrieval (e.g., based on the `query` in `query()` or the last message in `update_context`) and inject memories differently. ## Next Steps You've learned about `Memory`, AutoGen Core's mechanism for giving agents long-term recall beyond the immediate conversation (`ChatCompletionContext`). We saw how `MemoryContent` holds information, `add` stores it, `query` retrieves it, and `update_context` injects relevant memories into the LLM's working context. We explored the simple `ListMemory` as a basic example. Memory systems are crucial for agents that learn, adapt, or need to maintain state across interactions. This concludes our deep dive into the core abstractions of AutoGen Core! We've covered Agents, Messaging, Runtime, Tools, LLM Clients, Chat Context, and now Memory. There's one final concept that ties many of these together from a configuration perspective: * [Chapter 8: Component](08_component.md): Understand the general `Component` model in AutoGen Core, how it allows pieces like `Memory`, `ChatCompletionContext`, and `ChatCompletionClient` to be configured and managed consistently. --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/08_component.md ================================================ --- layout: default title: "Component" parent: "AutoGen Core" nav_order: 8 --- # Chapter 8: Component - The Standardized Building Blocks Welcome to Chapter 8! In our journey so far, we've met several key players in AutoGen Core: * [Agents](01_agent.md): The workers. * [Messaging System](02_messaging_system__topic___subscription_.md): How they communicate. * [AgentRuntime](03_agentruntime.md): The manager. * [Tools](04_tool.md): Their special skills. * [ChatCompletionClient](05_chatcompletionclient.md): How they talk to LLMs. * [ChatCompletionContext](06_chatcompletioncontext.md): How they remember recent chat history. * [Memory](07_memory.md): How they remember things long-term. Now, imagine you've built a fantastic agent system using these parts. You've configured a specific `ChatCompletionClient` to use OpenAI's `gpt-4o` model, and you've set up a `ListMemory` (from Chapter 7) to store user preferences. How do you save this exact setup so you can easily recreate it later, or share it with a friend? And what if you later want to swap out the `gpt-4o` client for a different one, like Anthropic's Claude, without rewriting your agent's core logic? This is where the **`Component`** concept comes in. It provides a standard way to define, configure, save, and load these reusable building blocks. ## Motivation: Making Setups Portable and Swappable Think of the parts we've used so far – `ChatCompletionClient`, `Memory`, `Tool` – like specialized **Lego bricks**. Each brick has a specific function (connecting to an LLM, remembering things, performing an action). Wouldn't it be great if: 1. Each Lego brick had a standard way to describe its properties (like "Red 2x4 Brick")? 2. You could easily save the description of all the bricks used in your creation (your agent system)? 3. Someone else could take that description and automatically rebuild your exact creation? 4. You could easily swap a "Red 2x4 Brick" for a "Blue 2x4 Brick" without having to rebuild everything around it? The `Component` abstraction in AutoGen Core provides exactly this! It makes your building blocks **configurable**, **savable**, **loadable**, and **swappable**. ## Key Concepts: Understanding Components Let's break down what makes the Component system work: 1. **Component:** A class (like `ListMemory` or `OpenAIChatCompletionClient`) that is designed to be a standard, reusable building block. It performs a specific role within the AutoGen ecosystem. Many core classes inherit from `Component` or related base classes. 2. **Configuration (`Config`):** Every Component has specific settings. For example, an `OpenAIChatCompletionClient` needs an API key and a model name. A `ListMemory` might have a name. These settings are defined in a standard way, usually using a Pydantic `BaseModel` specific to that component type. This `Config` acts like the "specification sheet" for the component instance. 3. **Saving Settings (`_to_config` method):** A Component instance knows how to generate its *current* configuration. It has an internal method, `_to_config()`, that returns a `Config` object representing its settings. This is like asking a configured Lego brick, "What color and size are you?" 4. **Loading Settings (`_from_config` class method):** A Component *class* knows how to create a *new* instance of itself from a given configuration. It has a class method, `_from_config(config)`, that takes a `Config` object and builds a new, configured component instance. This is like having instructions: "Build a brick with this color and size." 5. **`ComponentModel` (The Box):** This is the standard package format used to save and load components. It's like the label and instructions on the Lego box. A `ComponentModel` contains: * `provider`: A string telling AutoGen *which* Python class to use (e.g., `"autogen_core.memory.ListMemory"`). * `config`: A dictionary holding the specific settings for this instance (the output of `_to_config()`). * `component_type`: The general role of the component (e.g., `"memory"`, `"model"`, `"tool"`). * Other metadata like `version`, `description`, `label`. ```python # From: _component_config.py (Conceptual Structure) from pydantic import BaseModel from typing import Dict, Any class ComponentModel(BaseModel): provider: str # Path to the class (e.g., "autogen_core.memory.ListMemory") config: Dict[str, Any] # The specific settings for this instance component_type: str | None = None # Role (e.g., "memory") # ... other fields like version, description, label ... ``` This `ComponentModel` is what you typically save to a file (often as JSON or YAML). ## Use Case Example: Saving and Loading `ListMemory` Let's see how this works with the `ListMemory` we used in [Chapter 7: Memory](07_memory.md). **Goal:** 1. Create a `ListMemory` instance. 2. Save its configuration using the Component system (`dump_component`). 3. Load that configuration to create a *new*, identical `ListMemory` instance (`load_component`). **Step 1: Create and Configure a `ListMemory`** First, let's make a memory component. `ListMemory` is already designed as a Component. ```python # File: create_memory_component.py import asyncio from autogen_core.memory import ListMemory, MemoryContent # Create an instance of ListMemory my_memory = ListMemory(name="user_prefs_v1") # Add some content (from Chapter 7 example) async def add_content(): pref = MemoryContent(content="Use formal style", mime_type="text/plain") await my_memory.add(pref) print(f"Created memory '{my_memory.name}' with content: {my_memory.content}") asyncio.run(add_content()) # Output: Created memory 'user_prefs_v1' with content: [MemoryContent(content='Use formal style', mime_type='text/plain', metadata=None)] ``` We have our configured `my_memory` instance. **Step 2: Save the Configuration (`dump_component`)** Now, let's ask this component instance to describe itself by creating a `ComponentModel`. ```python # File: save_memory_config.py # Assume 'my_memory' exists from the previous step # Dump the component's configuration into a ComponentModel memory_model = my_memory.dump_component() # Let's print it (converting to dict for readability) print("Saved ComponentModel:") print(memory_model.model_dump_json(indent=2)) ``` **Expected Output:** ```json Saved ComponentModel: { "provider": "autogen_core.memory.ListMemory", "component_type": "memory", "version": 1, "component_version": 1, "description": "ListMemory stores memory content in a simple list.", "label": "ListMemory", "config": { "name": "user_prefs_v1", "memory_contents": [ { "content": "Use formal style", "mime_type": "text/plain", "metadata": null } ] } } ``` Look at the output! `dump_component` created a `ComponentModel` that contains: * `provider`: Exactly which class to use (`autogen_core.memory.ListMemory`). * `config`: The specific settings, including the `name` and even the `memory_contents` we added! * `component_type`: Its role is `"memory"`. * Other useful info like description and version. You could save this JSON structure to a file (`my_memory_config.json`). **Step 3: Load the Configuration (`load_component`)** Now, imagine you're starting a new script or sharing the config file. You can load this `ComponentModel` to recreate the memory instance. ```python # File: load_memory_config.py from autogen_core import ComponentModel from autogen_core.memory import ListMemory # Need the class for type hint/loading # Assume 'memory_model' is the ComponentModel we just created # (or loaded from a file) print(f"Loading component from ComponentModel (Provider: {memory_model.provider})...") # Use the ComponentLoader mechanism (available on Component classes) # to load the model. We specify the expected type (ListMemory). loaded_memory: ListMemory = ListMemory.load_component(memory_model) print(f"Successfully loaded memory!") print(f"- Name: {loaded_memory.name}") print(f"- Content: {loaded_memory.content}") ``` **Expected Output:** ``` Loading component from ComponentModel (Provider: autogen_core.memory.ListMemory)... Successfully loaded memory! - Name: user_prefs_v1 - Content: [MemoryContent(content='Use formal style', mime_type='text/plain', metadata=None)] ``` Success! `load_component` read the `ComponentModel`, found the right class (`ListMemory`), used its `_from_config` method with the saved `config` data, and created a brand new `loaded_memory` instance that is identical to our original `my_memory`. **Benefits Shown:** * **Reproducibility:** We saved the exact state (including content!) and loaded it perfectly. * **Configuration:** We could easily save this to a JSON/YAML file and manage it outside our Python code. * **Modularity (Conceptual):** If `ListMemory` and `VectorDBMemory` were both Components of type "memory", we could potentially load either one from a configuration file just by changing the `provider` and `config` in the file, without altering the agent code that *uses* the memory component (assuming the agent interacts via the standard `Memory` interface from Chapter 7). ## Under the Hood: How Saving and Loading Work Let's peek behind the curtain. **Saving (`dump_component`) Flow:** ```mermaid sequenceDiagram participant User participant MyMemory as my_memory (ListMemory instance) participant ListMemConfig as ListMemoryConfig (Pydantic Model) participant CompModel as ComponentModel User->>+MyMemory: dump_component() MyMemory->>MyMemory: Calls internal self._to_config() MyMemory->>+ListMemConfig: Creates Config object (name="...", contents=[...]) ListMemConfig-->>-MyMemory: Returns Config object MyMemory->>MyMemory: Gets provider string ("autogen_core.memory.ListMemory") MyMemory->>MyMemory: Gets component_type ("memory"), version, etc. MyMemory->>+CompModel: Creates ComponentModel(provider=..., config=config_dict, ...) CompModel-->>-MyMemory: Returns ComponentModel instance MyMemory-->>-User: Returns ComponentModel instance ``` 1. You call `my_memory.dump_component()`. 2. It calls its own `_to_config()` method. For `ListMemory`, this gathers the `name` and current `_contents`. 3. `_to_config()` returns a `ListMemoryConfig` object (a Pydantic model) holding these values. 4. `dump_component()` takes this `ListMemoryConfig` object, converts its data into a dictionary (`config` field). 5. It figures out its own class path (`provider`) and other metadata (`component_type`, `version`, etc.). 6. It packages all this into a `ComponentModel` object and returns it. **Loading (`load_component`) Flow:** ```mermaid sequenceDiagram participant User participant Loader as ComponentLoader (e.g., ListMemory.load_component) participant Importer as Python Import System participant ListMemClass as ListMemory (Class definition) participant ListMemConfig as ListMemoryConfig (Pydantic Model) participant NewMemory as New ListMemory Instance User->>+Loader: load_component(component_model) Loader->>Loader: Reads provider ("autogen_core.memory.ListMemory") from model Loader->>+Importer: Imports the class `autogen_core.memory.ListMemory` Importer-->>-Loader: Returns ListMemory class object Loader->>+ListMemClass: Checks if it's a valid Component class Loader->>ListMemClass: Gets expected config schema (ListMemoryConfig) Loader->>+ListMemConfig: Validates `config` dict from model against schema ListMemConfig-->>-Loader: Returns validated ListMemoryConfig object Loader->>+ListMemClass: Calls _from_config(validated_config) ListMemClass->>+NewMemory: Creates new ListMemory instance using config NewMemory-->>-ListMemClass: Returns new instance ListMemClass-->>-Loader: Returns new instance Loader-->>-User: Returns the new ListMemory instance ``` 1. You call `ListMemory.load_component(memory_model)`. 2. The loader reads the `provider` string from `memory_model`. 3. It dynamically imports the class specified by `provider`. 4. It verifies this class is a proper `Component` subclass. 5. It finds the configuration schema defined by the class (e.g., `ListMemoryConfig`). 6. It validates the `config` dictionary from `memory_model` using this schema. 7. It calls the class's `_from_config()` method, passing the validated configuration object. 8. `_from_config()` uses the configuration data to initialize and return a new instance of the class (e.g., a new `ListMemory` with the loaded name and content). 9. The loader returns this newly created instance. **Code Glimpse:** The core logic lives in `_component_config.py`. * **`Component` Base Class:** Classes like `ListMemory` inherit from `Component`. This requires them to define `component_type`, `component_config_schema`, and implement `_to_config()` and `_from_config()`. ```python # From: _component_config.py (Simplified Concept) from pydantic import BaseModel from typing import Type, TypeVar, Generic, ClassVar # ... other imports ConfigT = TypeVar("ConfigT", bound=BaseModel) class Component(Generic[ConfigT]): # Generic over its config type # Required Class Variables for Concrete Components component_type: ClassVar[str] component_config_schema: Type[ConfigT] # Required Instance Method for Saving def _to_config(self) -> ConfigT: raise NotImplementedError # Required Class Method for Loading @classmethod def _from_config(cls, config: ConfigT) -> Self: raise NotImplementedError # dump_component and load_component are also part of the system # (often inherited from base classes like ComponentBase) def dump_component(self) -> ComponentModel: ... @classmethod def load_component(cls, model: ComponentModel | Dict[str, Any]) -> Self: ... ``` * **`ComponentModel`:** As shown before, a Pydantic model to hold the `provider`, `config`, `type`, etc. * **`dump_component` Implementation (Conceptual):** ```python # Inside ComponentBase or similar def dump_component(self) -> ComponentModel: # 1. Get the specific config from the instance obj_config: BaseModel = self._to_config() config_dict = obj_config.model_dump() # Convert to dictionary # 2. Determine the provider string (class path) provider_str = _type_to_provider_str(self.__class__) # (Handle overrides like self.component_provider_override) # 3. Get other metadata comp_type = self.component_type comp_version = self.component_version # ... description, label ... # 4. Create and return the ComponentModel model = ComponentModel( provider=provider_str, config=config_dict, component_type=comp_type, version=comp_version, # ... other metadata ... ) return model ``` * **`load_component` Implementation (Conceptual):** ```python # Inside ComponentLoader or similar @classmethod def load_component(cls, model: ComponentModel | Dict[str, Any]) -> Self: # 1. Ensure we have a ComponentModel object if isinstance(model, dict): loaded_model = ComponentModel(**model) else: loaded_model = model # 2. Import the class based on the provider string provider_str = loaded_model.provider # ... (handle WELL_KNOWN_PROVIDERS mapping) ... module_path, class_name = provider_str.rsplit(".", 1) module = importlib.import_module(module_path) component_class = getattr(module, class_name) # 3. Validate the class and config if not is_component_class(component_class): # Check it's a valid Component raise TypeError(...) schema = component_class.component_config_schema validated_config = schema.model_validate(loaded_model.config) # 4. Call the class's factory method to create instance instance = component_class._from_config(validated_config) # 5. Return the instance (after type checks) return instance ``` This system provides a powerful and consistent way to manage the building blocks of your AutoGen applications. ## Wrapping Up Congratulations! You've reached the end of our core concepts tour. You now understand the `Component` model – AutoGen Core's standard way to define configurable, savable, and loadable building blocks like `Memory`, `ChatCompletionClient`, `Tool`, and even aspects of `Agents` themselves. * **Components** are like standardized Lego bricks. * They use **`_to_config`** to describe their settings. * They use **`_from_config`** to be built from settings. * **`ComponentModel`** is the standard "box" storing the provider and config, enabling saving/loading (often via JSON/YAML). This promotes: * **Modularity:** Easily swap implementations (e.g., different LLM clients). * **Reproducibility:** Save and load exact agent system configurations. * **Configuration:** Manage settings in external files. With these eight core concepts (`Agent`, `Messaging`, `AgentRuntime`, `Tool`, `ChatCompletionClient`, `ChatCompletionContext`, `Memory`, and `Component`), you have a solid foundation for understanding and building powerful multi-agent applications with AutoGen Core! Happy building! --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/AutoGen Core/index.md ================================================ --- layout: default title: "AutoGen Core" nav_order: 3 has_children: true --- # Tutorial: AutoGen Core > This tutorial is AI-generated! To learn more, check out [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) AutoGen Core[View Repo](https://github.com/microsoft/autogen/tree/e45a15766746d95f8cfaaa705b0371267bec812e/python/packages/autogen-core/src/autogen_core) helps you build applications with multiple **_Agents_** that can work together. Think of it like creating a team of specialized workers (*Agents*) who can communicate and use tools to solve problems. The **_AgentRuntime_** acts as the manager, handling messages and agent lifecycles. Agents communicate using a **_Messaging System_** (Topics and Subscriptions), can use **_Tools_** for specific tasks, interact with language models via a **_ChatCompletionClient_** while managing conversation history with **_ChatCompletionContext_**, and remember information using **_Memory_**. **_Components_** provide a standard way to define and configure these building blocks. ```mermaid flowchart TD A0["0: Agent"] A1["1: AgentRuntime"] A2["2: Messaging System (Topic & Subscription)"] A3["3: Component"] A4["4: Tool"] A5["5: ChatCompletionClient"] A6["6: ChatCompletionContext"] A7["7: Memory"] A1 -- "Manages lifecycle" --> A0 A1 -- "Uses for message routing" --> A2 A0 -- "Uses LLM client" --> A5 A0 -- "Executes tools" --> A4 A0 -- "Accesses memory" --> A7 A5 -- "Gets history from" --> A6 A5 -- "Uses tool schema" --> A4 A7 -- "Updates LLM context" --> A6 A4 -- "Implemented as" --> A3 ``` ================================================ FILE: docs/Browser Use/01_agent.md ================================================ --- layout: default title: "Agent" parent: "Browser Use" nav_order: 1 --- # Chapter 1: The Agent - Your Browser Assistant's Brain Welcome to the `Browser Use` tutorial! We're excited to help you learn how to automate web tasks using the power of Large Language Models (LLMs). Imagine you want to perform a simple task, like searching Google for "cute cat pictures" and clicking on the very first image result. For a human, this is easy! You open your browser, type in the search, look at the results, and click. But how do you tell a computer program to do this? It needs to understand the goal, look at the webpage like a human does, decide what to click or type next, and then actually perform those actions. This is where the **Agent** comes in. ## What Problem Does the Agent Solve? The Agent is the core orchestrator, the "brain" or "project manager" of your browser automation task. It connects all the different pieces needed to achieve your goal. Without the Agent, you'd have a bunch of tools (like a browser controller and an LLM) but no central coordinator telling them what to do and when. The Agent solves the problem of turning a high-level goal (like "find cat pictures") into concrete actions on a webpage, using intelligence to adapt to what it "sees" in the browser. ## Meet the Agent: Your Project Manager Think of the `Agent` like a project manager overseeing a complex task. It doesn't do *all* the work itself, but it coordinates specialists: 1. **Receives the Task:** You give the Agent the overall goal (e.g., "Search Google for 'cute cat pictures' and click the first image result."). 2. **Consults the Planner (LLM):** The Agent shows the current state of the webpage (using the [BrowserContext](03_browsercontext.md)) to a Large Language Model (LLM). It asks, "Here's the goal, and here's what the webpage looks like right now. What should be the very next step?" The LLM acts as a smart planner, suggesting actions like "type 'cute cat pictures' into the search bar" or "click the element with index 5". We'll learn more about how we instruct the LLM in the [System Prompt](02_system_prompt.md) chapter. 3. **Manages History:** The Agent keeps track of everything that has happened so far – the actions taken, the results, and the state of the browser at each step. This "memory" is managed by the [Message Manager](06_message_manager.md) and helps the LLM make better decisions. 4. **Instructs the Doer (Controller):** Once the LLM suggests an action (like "click element 5"), the Agent tells the [Action Controller & Registry](05_action_controller___registry.md) to actually perform that specific action within the browser. 5. **Observes the Results (BrowserContext):** After the Controller acts, the Agent uses the [BrowserContext](03_browsercontext.md) again to see the new state of the webpage (e.g., the Google search results page). 6. **Repeats:** The Agent repeats steps 2-5, continuously consulting the LLM, instructing the Controller, and observing the results, until the original task is complete or it reaches a stopping point. ## Using the Agent: A Simple Example Let's see how you might use the Agent in Python code. Don't worry about understanding every detail yet; focus on the main idea. We're setting up the Agent with our task and the necessary components. ```python # --- Simplified Example --- # We need to import the necessary parts from the browser_use library from browser_use import Agent, Browser, Controller, BrowserConfig, BrowserContextConfig # Assume 'my_llm' is your configured Large Language Model (e.g., from OpenAI, Anthropic) from my_llm_setup import my_llm # Placeholder for your specific LLM setup # 1. Define the task for the Agent my_task = "Go to google.com, search for 'cute cat pictures', and click the first image result." # 2. Basic browser configuration (we'll learn more later) browser_config = BrowserConfig() # Default settings context_config = BrowserContextConfig() # Default settings # 3. Initialize the components the Agent needs # The Browser manages the underlying browser application browser = Browser(config=browser_config) # The Controller knows *how* to perform actions like 'click' or 'type' controller = Controller() async def main(): # The BrowserContext represents a single browser tab/window environment # It uses the Browser and its configuration async with BrowserContext(browser=browser, config=context_config) as browser_context: # 4. Create the Agent instance! agent = Agent( task=my_task, llm=my_llm, # The "brain" - the Language Model browser_context=browser_context, # The "eyes" - interacts with the browser tab controller=controller # The "hands" - executes actions # Many other settings can be configured here! ) print(f"Agent created. Starting task: {my_task}") # 5. Run the Agent! This starts the loop. # It will keep taking steps until the task is done or it hits the limit. history = await agent.run(max_steps=15) # Limit steps for safety # 6. Check the result if history.is_done() and history.is_successful(): print("✅ Agent finished the task successfully!") print(f"Final message from agent: {history.final_result()}") else: print("⚠️ Agent stopped. Maybe max_steps reached or task wasn't completed successfully.") # The 'async with' block automatically cleans up the browser_context await browser.close() # Close the browser application # Run the asynchronous function import asyncio asyncio.run(main()) ``` **What happens when you run this?** 1. An `Agent` object is created with your task, the LLM, the browser context, and the controller. 2. Calling `agent.run(max_steps=15)` starts the main loop. 3. The Agent gets the initial state of the browser (likely a blank page). 4. It asks the LLM what to do. The LLM might say "Go to google.com". 5. The Agent tells the Controller to execute the "go to URL" action. 6. The browser navigates to Google. 7. The Agent gets the new state (Google's homepage). 8. It asks the LLM again. The LLM says "Type 'cute cat pictures' into the search bar". 9. The Agent tells the Controller to type the text. 10. This continues step-by-step: pressing Enter, seeing results, asking the LLM, clicking the image. 11. Eventually, the LLM will hopefully tell the Agent the task is "done". 12. `agent.run()` finishes and returns the `history` object containing details of what happened. ## How it Works Under the Hood: The Agent Loop Let's visualize the process with a simple diagram: ```mermaid sequenceDiagram participant User participant Agent participant LLM participant Controller participant BC as BrowserContext User->>Agent: Start task("Search Google for cats...") Note over Agent: Agent Loop Starts Agent->>BC: Get current state (e.g., blank page) BC-->>Agent: Current Page State Agent->>LLM: What's next? (Task + State + History) LLM-->>Agent: Plan: [Action: Type 'cute cat pictures', Action: Press Enter] Agent->>Controller: Execute: type_text(...) Controller->>BC: Perform type action Agent->>Controller: Execute: press_keys('Enter') Controller->>BC: Perform press action Agent->>BC: Get new state (search results page) BC-->>Agent: New Page State Agent->>LLM: What's next? (Task + New State + History) LLM-->>Agent: Plan: [Action: click_element(index=5)] Agent->>Controller: Execute: click_element(index=5) Controller->>BC: Perform click action Note over Agent: Loop continues until done... LLM-->>Agent: Plan: [Action: done(success=True, text='Found cat picture!')] Agent->>Controller: Execute: done(...) Controller-->>Agent: ActionResult (is_done=True) Note over Agent: Agent Loop Ends Agent->>User: Return History (Task Complete) ``` The core of the `Agent` lives in the `agent/service.py` file. The `Agent` class manages the overall process. 1. **Initialization (`__init__`)**: When you create an `Agent`, it sets up its internal state, stores the task, the LLM, the controller, and prepares the [Message Manager](06_message_manager.md) to keep track of the conversation history. It also figures out the best way to talk to the specific LLM you provided. ```python # --- File: agent/service.py (Simplified __init__) --- class Agent: def __init__( self, task: str, llm: BaseChatModel, browser_context: BrowserContext, controller: Controller, # ... other settings like use_vision, max_failures, etc. **kwargs ): self.task = task self.llm = llm self.browser_context = browser_context self.controller = controller self.settings = AgentSettings(**kwargs) # Store various settings self.state = AgentState() # Internal state (step count, failures, etc.) # Setup message manager for history, using the task and system prompt self._message_manager = MessageManager( task=self.task, system_message=self.settings.system_prompt_class(...).get_system_message(), settings=MessageManagerSettings(...) # ... more setup ... ) # ... other initializations ... logger.info("Agent initialized.") ``` 2. **Running the Task (`run`)**: The `run` method orchestrates the main loop. It calls the `step` method repeatedly until the task is marked as done, an error occurs, or `max_steps` is reached. ```python # --- File: agent/service.py (Simplified run method) --- class Agent: # ... (init) ... async def run(self, max_steps: int = 100) -> AgentHistoryList: self._log_agent_run() # Log start event try: for step_num in range(max_steps): if self.state.stopped or self.state.consecutive_failures >= self.settings.max_failures: break # Stop conditions # Wait if paused while self.state.paused: await asyncio.sleep(0.2) step_info = AgentStepInfo(step_number=step_num, max_steps=max_steps) await self.step(step_info) # <<< Execute one step of the loop if self.state.history.is_done(): await self.log_completion() # Log success/failure break # Exit loop if agent signaled 'done' else: logger.info("Max steps reached.") # Ran out of steps finally: # ... (cleanup, telemetry, potentially save history/gif) ... pass return self.state.history # Return the recorded history ``` 3. **Taking a Step (`step`)**: This is the heart of the loop. In each step, the Agent: * Gets the current browser state (`browser_context.get_state()`). * Adds this state to the history via the `_message_manager`. * Asks the LLM for the next action (`get_next_action()`). * Tells the `Controller` to execute the action(s) (`multi_act()`). * Records the outcome in the history. * Handles any errors that might occur. ```python # --- File: agent/service.py (Simplified step method) --- class Agent: # ... (init, run) ... async def step(self, step_info: Optional[AgentStepInfo] = None) -> None: logger.info(f"📍 Step {self.state.n_steps}") state = None model_output = None result: list[ActionResult] = [] try: # 1. Get current state from the browser state = await self.browser_context.get_state() # Uses BrowserContext # 2. Add state (+ previous result) to message history for LLM context self._message_manager.add_state_message(state, self.state.last_result, ...) # 3. Get LLM's decision on the next action(s) input_messages = self._message_manager.get_messages() model_output = await self.get_next_action(input_messages) # Calls the LLM self.state.n_steps += 1 # Increment step counter # 4. Execute the action(s) using the Controller result = await self.multi_act(model_output.action) # Uses Controller self.state.last_result = result # Store result for next step's context # 5. Record step details (actions, results, state snapshot) self._make_history_item(model_output, state, result, ...) self.state.consecutive_failures = 0 # Reset failure count on success except Exception as e: # Handle errors, increment failure count, maybe retry later result = await self._handle_step_error(e) self.state.last_result = result # ... (finally block for logging/telemetry) ... ``` ## Conclusion You've now met the `Agent`, the central coordinator in `Browser Use`. You learned that it acts like a project manager, taking your high-level task, consulting an LLM for step-by-step planning, managing the history, and instructing a `Controller` to perform actions within a `BrowserContext`. The Agent's effectiveness heavily relies on how well we instruct the LLM planner. In the next chapter, we'll dive into exactly that: crafting the **System Prompt** to guide the LLM's behavior. [Next Chapter: System Prompt](02_system_prompt.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/02_system_prompt.md ================================================ --- layout: default title: "System Prompt" parent: "Browser Use" nav_order: 2 --- # Chapter 2: The System Prompt - Setting the Rules for Your AI Assistant In [Chapter 1: The Agent](01_agent.md), we met the `Agent`, our project manager for automating browser tasks. We saw it consults a Large Language Model (LLM) – the "planner" – to decide the next steps based on the current state of the webpage. But how does the Agent tell the LLM *how* it should think, behave, and respond? Just giving it the task isn't enough! Imagine hiring a new assistant. You wouldn't just say, "Organize my files!" You'd give them specific instructions: "Please sort the files alphabetically by client name, put them in the blue folders, and give me a summary list when you're done." Without these rules, the assistant might do something completely different! The **System Prompt** solves this exact problem for our LLM. It's the set of core instructions and rules we give the LLM at the very beginning, telling it exactly how to act as a browser automation assistant and, crucially, how to format its responses so the `Agent` can understand them. ## What is the System Prompt? The AI's Rulebook Think of the System Prompt like the AI assistant's fundamental operating manual, its "Prime Directive," or the rules of a board game. It defines: 1. **Persona:** "You are an AI agent designed to automate browser tasks." 2. **Goal:** "Your goal is to accomplish the ultimate task..." 3. **Input:** How to understand the information it receives about the webpage ([DOM Representation](04_dom_representation.md)). 4. **Capabilities:** What actions it can take ([Action Controller & Registry](05_action_controller___registry.md)). 5. **Limitations:** What it *shouldn't* do (e.g., hallucinate actions). 6. **Response Format:** The *exact* structure (JSON format) its thoughts and planned actions must follow. Without this rulebook, the LLM might just chat casually, give vague suggestions, or produce output in a format the `Agent` code can't parse. The System Prompt ensures the LLM behaves like the specialized tool we need. ## Why is the Response Format So Important? This is a critical point. The `Agent` code isn't a human reading the LLM's response. It's a program expecting data in a very specific structure. The System Prompt tells the LLM to *always* respond in a JSON format that looks something like this (simplified): ```json { "current_state": { "evaluation_previous_goal": "Success - Found the search bar.", "memory": "On google.com main page. Need to search for cats.", "next_goal": "Type 'cute cat pictures' into the search bar." }, "action": [ { "input_text": { "index": 5, // The index of the search bar element "text": "cute cat pictures" } }, { "press_keys": { "keys": "Enter" // Press the Enter key } } ] } ``` The `Agent` can easily read this JSON: * It understands the LLM's thoughts (`current_state`). * It sees the exact `action` list the LLM wants to perform. * It passes these actions (like `input_text` or `press_keys`) to the [Action Controller & Registry](05_action_controller___registry.md) to execute them in the browser. If the LLM responded with just "Okay, I'll type 'cute cat pictures' into the search bar and press Enter," the `Agent` wouldn't know *which* element index corresponds to the search bar or exactly which actions to call. The strict JSON format is essential for automation. ## A Peek Inside the Rulebook (`system_prompt.md`) The actual instructions live in a text file within the `Browser Use` library: `browser_use/agent/system_prompt.md`. It's quite detailed, but here's a tiny snippet focusing on the response format rule: ```markdown # Response Rules 1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format: {{"current_state": {{"evaluation_previous_goal": "...", "memory": "...", "next_goal": "..."}}, "action":[{{"one_action_name": {{...}}}}, ...]}} 2. ACTIONS: You can specify multiple actions in the list... Use maximum {{max_actions}} actions... ``` *(This is heavily simplified! The real file has many more rules about element interaction, error handling, task completion, etc.)* This file clearly defines the JSON structure (`current_state` and `action`) and other crucial behaviors required from the LLM. ## How the Agent Uses the System Prompt The `Agent` uses a helper class called `SystemPrompt` (found in `agent/prompts.py`) to manage these rules. Here's the flow: 1. **Loading:** When you create an `Agent`, it internally creates a `SystemPrompt` object. This object reads the rules from the `system_prompt.md` file. 2. **Formatting:** The `SystemPrompt` object formats these rules into a special `SystemMessage` object that LLMs understand as foundational instructions. 3. **Conversation Start:** This `SystemMessage` is given to the [Message Manager](06_message_manager.md), which keeps track of the conversation history with the LLM. The `SystemMessage` becomes the *very first message*, setting the context for all future interactions in that session. Think of it like starting a meeting: the first thing you do is state the agenda and rules (System Prompt), and then the discussion (LLM interaction) follows based on that foundation. Let's look at a simplified view of the `SystemPrompt` class loading the rules: ```python # --- File: agent/prompts.py (Simplified) --- import importlib.resources # Helps find files within the installed library from langchain_core.messages import SystemMessage # Special message type for LLMs class SystemPrompt: def __init__(self, action_description: str, max_actions_per_step: int = 10): # We ignore these details for now self.default_action_description = action_description self.max_actions_per_step = max_actions_per_step self._load_prompt_template() # <--- Loads the rules file def _load_prompt_template(self) -> None: """Load the prompt rules from the system_prompt.md file.""" try: # Finds the 'system_prompt.md' file inside the browser_use package filepath = importlib.resources.files('browser_use.agent').joinpath('system_prompt.md') with filepath.open('r') as f: self.prompt_template = f.read() # Read the text content print("System Prompt template loaded successfully!") except Exception as e: print(f"Error loading system prompt: {e}") self.prompt_template = "Error: Could not load prompt." # Fallback def get_system_message(self) -> SystemMessage: """Format the loaded rules into a message for the LLM.""" # Replace placeholders like {{max_actions}} with actual values prompt = self.prompt_template.format(max_actions=self.max_actions_per_step) # Wrap the final rules text in a SystemMessage object return SystemMessage(content=prompt) # --- How it plugs into Agent creation (Conceptual) --- # from browser_use import Agent, SystemPrompt # from my_llm_setup import my_llm # Your LLM # ... other setup ... # When you create an Agent: # agent = Agent( # task="Find cat pictures", # llm=my_llm, # browser_context=..., # controller=..., # # The Agent's __init__ method does something like this internally: # # system_prompt_obj = SystemPrompt(action_description="...", max_actions_per_step=10) # # system_message_for_llm = system_prompt_obj.get_system_message() # # This system_message_for_llm is then passed to the Message Manager. # ) ``` This code shows how the `SystemPrompt` class finds and reads the `system_prompt.md` file and prepares the instructions as a `SystemMessage` ready for the LLM conversation. ## Under the Hood: Initialization and Conversation Flow Let's visualize how the System Prompt fits into the Agent's setup and interaction loop: ```mermaid sequenceDiagram participant User participant Agent_Init as Agent Initialization participant SP as SystemPrompt Class participant MM as Message Manager participant Agent_Run as Agent Run Loop participant LLM User->>Agent_Init: Create Agent(task, llm, ...) Note over Agent_Init: Agent needs the rules! Agent_Init->>SP: Create SystemPrompt(...) SP->>SP: _load_prompt_template() reads system_prompt.md SP-->>Agent_Init: SystemPrompt instance Agent_Init->>SP: get_system_message() SP-->>Agent_Init: system_message (The Formatted Rules) Note over Agent_Init: Pass rules to conversation manager Agent_Init->>MM: Initialize MessageManager(task, system_message) MM->>MM: Store system_message as message #1 MM-->>Agent_Init: MessageManager instance ready Agent_Init-->>User: Agent created and ready User->>Agent_Run: agent.run() starts the task Note over Agent_Run: Agent needs context for LLM Agent_Run->>MM: get_messages() MM-->>Agent_Run: [system_message, user_message(state), ...] Note over Agent_Run: Send rules + current state to LLM Agent_Run->>LLM: Ask for next action (Input includes rules) LLM-->>Agent_Run: JSON response (LLM followed rules!) Agent_Run->>MM: add_model_output(...) Note over Agent_Run: Loop continues... ``` Internally, the `Agent`'s initialization code (`__init__` in `agent/service.py`) explicitly creates the `SystemPrompt` and passes its output to the `MessageManager`: ```python # --- File: agent/service.py (Simplified Agent __init__) --- # ... other imports ... from browser_use.agent.prompts import SystemPrompt # Import the class from browser_use.agent.message_manager.service import MessageManager, MessageManagerSettings class Agent: def __init__( self, task: str, llm: BaseChatModel, browser_context: BrowserContext, controller: Controller, system_prompt_class: Type[SystemPrompt] = SystemPrompt, # Allows customizing the prompt class max_actions_per_step: int = 10, # ... other parameters ... **kwargs ): self.task = task self.llm = llm # ... store other components ... # Get the list of available actions from the controller self.available_actions = controller.registry.get_prompt_description() # 1. Create the SystemPrompt instance using the provided class system_prompt_instance = system_prompt_class( action_description=self.available_actions, max_actions_per_step=max_actions_per_step, ) # 2. Get the formatted SystemMessage (the rules) system_message = system_prompt_instance.get_system_message() # 3. Initialize the Message Manager with the task and the rules self._message_manager = MessageManager( task=self.task, system_message=system_message, # <--- Pass the rules here! settings=MessageManagerSettings(...) # ... other message manager setup ... ) # ... rest of initialization ... logger.info("Agent initialized with System Prompt.") ``` When the `Agent` runs its loop (`agent.run()` calls `agent.step()`), it asks the `MessageManager` for the current conversation history (`self._message_manager.get_messages()`). The `MessageManager` always ensures that the `SystemMessage` (containing the rules) is the very first item in that history list sent to the LLM. ## Conclusion The System Prompt is the essential rulebook that governs the LLM's behavior within the `Browser Use` framework. It tells the LLM how to interpret the browser state, what actions it can take, and most importantly, dictates the exact JSON format for its responses. This structured communication is key to enabling the `Agent` to reliably understand the LLM's plan and execute browser automation tasks. Without a clear System Prompt, the LLM would be like an untrained assistant – potentially intelligent, but unable to follow the specific procedures needed for the job. Now that we understand how the `Agent` gets its fundamental instructions, how does it actually perceive the webpage it's supposed to interact with? In the next chapter, we'll explore the component responsible for representing the browser's state: the [BrowserContext](03_browsercontext.md). [Next Chapter: BrowserContext](03_browsercontext.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/03_browsercontext.md ================================================ --- layout: default title: "BrowserContext" parent: "Browser Use" nav_order: 3 --- # Chapter 3: BrowserContext - The Agent's Isolated Workspace In the [previous chapter](02_system_prompt.md), we learned how the `System Prompt` acts as the rulebook for the AI assistant (LLM) that guides our `Agent`. We know the Agent uses the LLM to decide *what* to do next based on the current situation in the browser. But *where* does the Agent actually "see" the webpage and perform its actions? How does it keep track of the current website address (URL), the page content, and things like cookies, all while staying focused on its specific task without getting mixed up with your other browsing? This is where the **BrowserContext** comes in. ## What Problem Does BrowserContext Solve? Imagine you ask your `Agent` to log into a specific online shopping website and check your order status. You might already be logged into that same website in your regular browser window with your personal account. If the Agent just used your main browser window, it might: 1. Get confused by your existing login. 2. Accidentally use your personal cookies or saved passwords. 3. Interfere with other tabs you have open. We need a way to give the Agent its *own*, clean, separate browsing environment for each task. It needs an isolated "workspace" where it can open websites, log in, click buttons, and manage its own cookies without affecting anything else. The `BrowserContext` solves this by representing a single, isolated browser session. ## Meet the BrowserContext: Your Agent's Private Browser Window Think of a `BrowserContext` like opening a brand new **Incognito Window** or creating a **separate User Profile** in your web browser (like Chrome or Firefox). * **It's Isolated:** What happens in one `BrowserContext` doesn't affect others or your main browser session. It has its own cookies, its own history (for that session), and its own set of tabs. * **It Manages State:** It keeps track of everything important about the current web session the Agent is working on: * The current URL. * Which tabs are open within its "window". * Cookies specific to that session. * The structure and content of the current webpage (the DOM - Document Object Model, which we'll explore in the [next chapter](04_dom_representation.md)). * **It's the Agent's Viewport:** The `Agent` looks through the `BrowserContext` to "see" the current state of the webpage. When the Agent decides to perform an action (like clicking a button), it tells the [Action Controller](05_action_controller___registry.md) to perform it *within* that specific `BrowserContext`. Essentially, the `BrowserContext` is like a dedicated, clean desk or workspace given to the Agent for its specific job. ## Using the BrowserContext Before we can have an isolated session (`BrowserContext`), we first need the main browser application itself. This is handled by the `Browser` class. Think of `Browser` as the entire Chrome or Firefox application installed on your computer, while `BrowserContext` is just one window or profile within that application. Here's a simplified example of how you might set up a `Browser` and then create a `BrowserContext` to navigate to a page: ```python import asyncio # Import necessary classes from browser_use import Browser, BrowserConfig, BrowserContext, BrowserContextConfig async def main(): # 1. Configure the main browser application (optional, defaults are usually fine) browser_config = BrowserConfig(headless=False) # Show the browser window # 2. Create the main Browser instance # This might launch a browser application in the background (or connect to one) browser = Browser(config=browser_config) print("Browser application instance created.") # 3. Configure the specific session/window (optional) context_config = BrowserContextConfig( user_agent="MyCoolAgent/1.0", # Example: Set a custom user agent cookies_file="my_session_cookies.json" # Example: Save/load cookies ) # 4. Create the isolated BrowserContext (like opening an incognito window) # We use 'async with' to ensure it cleans up automatically afterwards async with browser.new_context(config=context_config) as browser_context: print(f"BrowserContext created (ID: {browser_context.context_id}).") # 5. Use the context to interact with the browser session start_url = "https://example.com" print(f"Navigating to: {start_url}") await browser_context.navigate_to(start_url) # 6. Get information *from* the context current_state = await browser_context.get_state() # Get current page info print(f"Current page title: {current_state.title}") print(f"Current page URL: {current_state.url}") # The Agent would use this 'browser_context' object to see the page # and tell the Controller to perform actions within it. print("BrowserContext closed automatically.") # 7. Close the main browser application when done await browser.close() print("Browser application closed.") # Run the asynchronous code asyncio.run(main()) ``` **What happens here?** 1. We set up a `BrowserConfig` (telling it *not* to run headless so we can see the window). 2. We create a `Browser` instance, which represents the overall browser program. 3. We create a `BrowserContextConfig` to specify settings for our isolated session (like a custom name or where to save cookies). 4. Crucially, `browser.new_context(...)` creates our isolated session. The `async with` block ensures this session is properly closed later. 5. We use methods *on the `browser_context` object* like `navigate_to()` to control *this specific session*. 6. We use `browser_context.get_state()` to get information about the current page within *this session*. The `Agent` heavily relies on this method. 7. After the `async with` block finishes, the `browser_context` is closed (like closing the incognito window), and finally, we close the main `browser` application. ## How it Works Under the Hood When the `Agent` needs to understand the current situation to decide the next step, it asks the `BrowserContext` for the latest state using the `get_state()` method. What happens then? 1. **Wait for Stability:** The `BrowserContext` first waits for the webpage to finish loading and for network activity to settle down (`_wait_for_page_and_frames_load`). This prevents the Agent from acting on an incomplete page. 2. **Analyze the Page:** It then uses the [DOM Representation](04_dom_representation.md) service (`DomService`) to analyze the current HTML structure of the page. This service figures out which elements are visible, interactive (buttons, links, input fields), and where they are. 3. **Capture Visuals:** It often takes a screenshot of the current view (`take_screenshot`). This can be helpful for advanced agents or debugging. 4. **Gather Metadata:** It gets the current URL, page title, and information about any other tabs open *within this context*. 5. **Package the State:** All this information (DOM structure, URL, title, screenshot, etc.) is bundled into a `BrowserState` object. 6. **Return to Agent:** The `BrowserContext` returns this `BrowserState` object to the `Agent`. The Agent then uses this information (often sending it to the LLM) to plan its next action. Here's a simplified diagram of the `get_state()` process: ```mermaid sequenceDiagram participant Agent participant BC as BrowserContext participant PlaywrightPage as Underlying Browser Page participant DomService as DOM Service Agent->>BC: get_state() Note over BC: Wait for page to be ready... BC->>PlaywrightPage: Ensure page/network is stable PlaywrightPage-->>BC: Page is ready Note over BC: Analyze the page content... BC->>DomService: Get simplified DOM structure + interactive elements DomService-->>BC: DOMState (element tree, etc.) Note over BC: Get visuals and metadata... BC->>PlaywrightPage: Take screenshot() PlaywrightPage-->>BC: Screenshot data BC->>PlaywrightPage: Get URL, Title PlaywrightPage-->>BC: URL, Title data Note over BC: Combine everything... BC->>BC: Create BrowserState object BC-->>Agent: Return BrowserState ``` Let's look at some simplified code snippets from the library. The `BrowserContext` is initialized (`__init__` in `browser/context.py`) with its configuration and a reference to the main `Browser` instance that created it. ```python # --- File: browser/context.py (Simplified __init__) --- import uuid # ... other imports ... if TYPE_CHECKING: from browser_use.browser.browser import Browser # Link to the Browser class @dataclass class BrowserContextConfig: # Configuration settings # ... various settings like user_agent, cookies_file, window_size ... pass @dataclass class BrowserSession: # Holds the actual Playwright context context: PlaywrightBrowserContext # The underlying Playwright object cached_state: Optional[BrowserState] = None # Stores the last known state class BrowserContext: def __init__( self, browser: 'Browser', # Reference to the main Browser instance config: BrowserContextConfig = BrowserContextConfig(), # ... other optional state ... ): self.context_id = str(uuid.uuid4()) # Unique ID for this session self.config = config # Store the configuration self.browser = browser # Store the reference to the parent Browser # The actual Playwright session is created later, when needed self.session: BrowserSession | None = None logger.debug(f"BrowserContext object created (ID: {self.context_id}). Session not yet initialized.") # The 'async with' statement calls __aenter__ which initializes the session async def __aenter__(self): await self._initialize_session() # Creates the actual browser window/tab return self async def _initialize_session(self): # ... (complex setup code happens here) ... # Gets the main Playwright browser from self.browser playwright_browser = await self.browser.get_playwright_browser() # Creates the isolated Playwright context (like the incognito window) context = await self._create_context(playwright_browser) # Creates the BrowserSession to hold the context and state self.session = BrowserSession(context=context, cached_state=None) logger.debug(f"BrowserContext session initialized (ID: {self.context_id}).") # ... (sets up the initial page) ... return self.session # ... other methods like navigate_to, close, etc. ... ``` The `get_state` method orchestrates fetching the current information from the browser session. ```python # --- File: browser/context.py (Simplified get_state and helpers) --- # ... other imports ... from browser_use.dom.service import DomService # Imports the DOM analyzer from browser_use.browser.views import BrowserState # Imports the state structure class BrowserContext: # ... (init, aenter, etc.) ... async def get_state(self) -> BrowserState: """Get the current state of the browser session.""" logger.debug(f"Getting state for context {self.context_id}...") # 1. Make sure the page is loaded and stable await self._wait_for_page_and_frames_load() # 2. Get the actual Playwright session object session = await self.get_session() # 3. Update the state (this does the heavy lifting) session.cached_state = await self._update_state() logger.debug(f"State update complete for {self.context_id}.") # 4. Optionally save cookies if configured if self.config.cookies_file: asyncio.create_task(self.save_cookies()) return session.cached_state async def _wait_for_page_and_frames_load(self, timeout_overwrite: float | None = None): """Ensures page is fully loaded before continuing.""" # ... (complex logic to wait for network idle, minimum times) ... page = await self.get_current_page() await page.wait_for_load_state('load', timeout=5000) # Simplified wait logger.debug("Page load/network stability checks passed.") await asyncio.sleep(self.config.minimum_wait_page_load_time) # Ensure minimum wait async def _update_state(self) -> BrowserState: """Fetches all info and builds the BrowserState.""" session = await self.get_session() page = await self.get_current_page() # Get the active Playwright page object try: # Use DomService to analyze the page content dom_service = DomService(page) # Get the simplified DOM tree and interactive elements map content_info = await dom_service.get_clickable_elements( highlight_elements=self.config.highlight_elements, # ... other DOM options ... ) # Take a screenshot screenshot_b64 = await self.take_screenshot() # Get URL, Title, Tabs, Scroll info etc. url = page.url title = await page.title() tabs = await self.get_tabs_info() pixels_above, pixels_below = await self.get_scroll_info(page) # Create the BrowserState object browser_state = BrowserState( element_tree=content_info.element_tree, selector_map=content_info.selector_map, url=url, title=title, tabs=tabs, screenshot=screenshot_b64, pixels_above=pixels_above, pixels_below=pixels_below, ) return browser_state except Exception as e: logger.error(f'Failed to update state: {str(e)}') # Maybe return old state or raise error raise BrowserError("Failed to get browser state") from e async def take_screenshot(self, full_page: bool = False) -> str: """Takes a screenshot and returns base64 encoded string.""" page = await self.get_current_page() screenshot_bytes = await page.screenshot(full_page=full_page, animations='disabled') return base64.b64encode(screenshot_bytes).decode('utf-8') # ... many other helper methods (_get_current_page, get_tabs_info, etc.) ... ``` This shows how `BrowserContext` acts as a manager for a specific browser session, using underlying tools (like Playwright and `DomService`) to gather the necessary information (`BrowserState`) that the `Agent` needs to operate. ## Conclusion The `BrowserContext` is a fundamental concept in `Browser Use`. It provides the necessary **isolated environment** for the `Agent` to perform its tasks, much like an incognito window or a separate browser profile. It manages the session's state (URL, cookies, tabs, page content) and provides the `Agent` with a snapshot of the current situation via the `get_state()` method. Understanding the `BrowserContext` helps clarify *where* the Agent works. Now, how does the Agent actually understand the *content* of the webpage within that context? How is the complex structure of a webpage represented in a way the Agent (and the LLM) can understand? In the next chapter, we'll dive into exactly that: the [DOM Representation](04_dom_representation.md). [Next Chapter: DOM Representation](04_dom_representation.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/04_dom_representation.md ================================================ --- layout: default title: "DOM Representation" parent: "Browser Use" nav_order: 4 --- # Chapter 4: DOM Representation - Mapping the Webpage In the [previous chapter](03_browsercontext.md), we learned about the `BrowserContext`, the Agent's private workspace for browsing. We saw that the Agent uses `browser_context.get_state()` to get a snapshot of the current webpage. But how does the Agent actually *understand* the content of that snapshot? Imagine you're looking at the Google homepage. You instantly recognize the logo, the search bar, and the buttons. But a computer program just sees a wall of code (HTML). How can our `Agent` figure out: "This rectangular box is the search bar I need to type into," or "This specific image link is the first result I should click"? This is the problem solved by **DOM Representation**. ## What Problem Does DOM Representation Solve? Webpages are built using HTML (HyperText Markup Language), which describes the structure and content. Your browser reads this HTML and creates an internal, structured representation called the **Document Object Model (DOM)**. It's like the browser builds a detailed blueprint or an outline from the HTML instructions. However, this raw DOM blueprint is incredibly complex and contains lots of information irrelevant to our Agent's task. The Agent doesn't need to know about every single tiny visual detail; it needs a *simplified map* focused on what's important for interaction: 1. **What elements are on the page?** (buttons, links, input fields, text) 2. **Are they visible to a user?** (Hidden elements shouldn't be interacted with) 3. **Are they interactive?** (Can you click it? Can you type in it?) 4. **How can the Agent refer to them?** (We need a simple way to say "click *this* button") DOM Representation solves the problem of translating the complex, raw DOM blueprint into a simplified, structured map that highlights the interactive "landmarks" and pathways the Agent can use. ## Meet `DomService`: The Map Maker The component responsible for creating this map is the `DomService`. Think of it as a cartographer specializing in webpages. When the `Agent` (via the `BrowserContext`) asks for the current state of the page, the `BrowserContext` employs the `DomService` to analyze the page's live DOM. Here's what the `DomService` does: 1. **Examines the Live Page:** It looks at the current structure rendered in the browser tab, not just the initial HTML source code (because JavaScript can change the page after it loads). 2. **Identifies Elements:** It finds all the meaningful elements like buttons, links, input fields, and text blocks. 3. **Checks Properties:** For each element, it determines crucial properties: * **Visibility:** Is it actually displayed on the screen? * **Interactivity:** Is it something a user can click, type into, or otherwise interact with? * **Position:** Where is it located (roughly)? 4. **Assigns Interaction Indices:** This is key! For elements deemed interactive and visible, `DomService` assigns a unique number, called a `highlight_index` (like `[5]`, `[12]`, etc.). This gives the Agent and the LLM a simple, unambiguous way to refer to specific elements. 5. **Builds a Structured Tree:** It organizes this information into a simplified tree structure (`element_tree`) that reflects the page layout but is much easier to process than the full DOM. 6. **Creates an Index Map:** It generates a `selector_map`, which is like an index in a book, mapping each `highlight_index` directly to its corresponding element node in the tree. The final output is a `DOMState` object containing the simplified `element_tree` and the handy `selector_map`. This `DOMState` is then included in the `BrowserState` that `BrowserContext.get_state()` returns to the Agent. ## The Output: `DOMState` - The Agent's Map The `DOMState` object produced by `DomService` has two main parts: 1. **`element_tree`:** This is the root of our simplified map, represented as a `DOMElementNode` object (defined in `dom/views.py`). Each node in the tree can be either an element (`DOMElementNode`) or a piece of text (`DOMTextNode`). `DOMElementNode`s contain information like the tag name (`\n[7]Images" // And respond with: { "current_state": { "evaluation_previous_goal": "...", "memory": "On Google homepage, need to search for cats.", "next_goal": "Type 'cute cats' into the search bar [5]." }, "action": [ { "input_text": { "index": 5, // <-- Uses the highlight_index from the DOM map! "text": "cute cats" } } // ... maybe press Enter action ... ] } ``` ## Code Example: Seeing the Map We don't usually interact with `DomService` directly. Instead, we get its output via the `BrowserContext`. Let's revisit the example from Chapter 3 and see where the DOM representation fits: ```python import asyncio from browser_use import Browser, BrowserConfig, BrowserContext, BrowserContextConfig async def main(): browser_config = BrowserConfig(headless=False) browser = Browser(config=browser_config) context_config = BrowserContextConfig() async with browser.new_context(config=context_config) as browser_context: # Navigate to a page (e.g., Google) await browser_context.navigate_to("https://www.google.com") print("Getting current page state...") # This call uses DomService internally to generate the DOM representation current_state = await browser_context.get_state() print(f"\nCurrent Page URL: {current_state.url}") print(f"Current Page Title: {current_state.title}") # Accessing the DOM Representation parts within the BrowserState print("\n--- DOM Representation Details ---") # The element_tree is the root node of our simplified DOM map if current_state.element_tree: print(f"Root element tag of simplified tree: <{current_state.element_tree.tag_name}>") else: print("Element tree is empty.") # The selector_map provides direct access to interactive elements by index if current_state.selector_map: print(f"Number of interactive elements found: {len(current_state.selector_map)}") # Let's try to find the element the LLM might call [5] (often the search bar) example_index = 5 # Note: Indices can change depending on the page! if example_index in current_state.selector_map: element_node = current_state.selector_map[example_index] print(f"Element [{example_index}]: Tag=<{element_node.tag_name}>, Attributes={element_node.attributes}") # The Agent uses this node reference to perform actions else: print(f"Element [{example_index}] not found in the selector map for this page state.") else: print("No interactive elements found (selector map is empty).") # The Agent would typically convert element_tree into a compact text format # (using methods like element_tree.clickable_elements_to_string()) # to send to the LLM along with the task instructions. print("\nBrowserContext closed.") await browser.close() print("Browser closed.") # Run the asynchronous code asyncio.run(main()) ``` **What happens here?** 1. We set up the `Browser` and `BrowserContext`. 2. We navigate to Google. 3. `browser_context.get_state()` is called. **Internally**, this triggers the `DomService`. 4. `DomService` analyzes the Google page, finds interactive elements (like the search bar, buttons), assigns them `highlight_index` numbers, and builds the `element_tree` and `selector_map`. 5. This `DOMState` (containing the tree and map) is packaged into the `BrowserState` object returned by `get_state()`. 6. Our code then accesses `current_state.element_tree` and `current_state.selector_map` to peek at the map created by `DomService`. 7. We demonstrate looking up an element using its potential index (`selector_map[5]`). ## How It Works Under the Hood: `DomService` in Action Let's trace the flow when `BrowserContext.get_state()` is called: ```mermaid sequenceDiagram participant Agent participant BC as BrowserContext participant DomService participant PlaywrightPage as Browser Page (JS Env) participant buildDomTree_js as buildDomTree.js Agent->>BC: get_state() Note over BC: Needs to analyze the page content BC->>DomService: get_clickable_elements(...) Note over DomService: Needs to run analysis script in browser DomService->>PlaywrightPage: evaluate(js_code='buildDomTree.js', args={...}) Note over PlaywrightPage: Execute JavaScript code PlaywrightPage->>buildDomTree_js: Run analysis function Note over buildDomTree_js: Analyzes live DOM, finds visible & interactive elements, assigns highlight_index buildDomTree_js-->>PlaywrightPage: Return structured data (nodes, indices, map) PlaywrightPage-->>DomService: Return JS execution result (JSON-like data) Note over DomService: Process the raw data from JS DomService->>DomService: _construct_dom_tree(result) Note over DomService: Builds Python DOMElementNode tree and selector_map DomService-->>BC: Return DOMState (element_tree, selector_map) Note over BC: Combine DOMState with URL, title, screenshot etc. BC->>BC: Create BrowserState object BC-->>Agent: Return BrowserState (containing DOM map) ``` **Key Code Points:** 1. **`BrowserContext` calls `DomService`:** Inside `browser/context.py`, the `_update_state` method (called by `get_state`) initializes and uses the `DomService`: ```python # --- File: browser/context.py (Simplified _update_state) --- from browser_use.dom.service import DomService # Import the service from browser_use.browser.views import BrowserState class BrowserContext: # ... other methods ... async def _update_state(self) -> BrowserState: page = await self.get_current_page() # Get the active Playwright page object # ... error handling ... try: # 1. Create DomService instance for the current page dom_service = DomService(page) # 2. Call DomService to get the DOM map (DOMState) content_info = await dom_service.get_clickable_elements( highlight_elements=self.config.highlight_elements, viewport_expansion=self.config.viewport_expansion, # ... other options ... ) # 3. Get other info (screenshot, URL, title etc.) screenshot_b64 = await self.take_screenshot() url = page.url title = await page.title() # ... gather more state ... # 4. Package everything into BrowserState browser_state = BrowserState( element_tree=content_info.element_tree, # <--- From DomService selector_map=content_info.selector_map, # <--- From DomService url=url, title=title, screenshot=screenshot_b64, # ... other state info ... ) return browser_state except Exception as e: logger.error(f'Failed to update state: {str(e)}') raise # Or handle error ``` 2. **`DomService` runs JavaScript:** Inside `dom/service.py`, the `_build_dom_tree` method executes the JavaScript code stored in `buildDomTree.js` within the browser page's context. ```python # --- File: dom/service.py (Simplified _build_dom_tree) --- import logging from importlib import resources # ... other imports ... logger = logging.getLogger(__name__) class DomService: def __init__(self, page: 'Page'): self.page = page # Load the JavaScript code from the file when DomService is created self.js_code = resources.read_text('browser_use.dom', 'buildDomTree.js') # ... async def _build_dom_tree( self, highlight_elements: bool, focus_element: int, viewport_expansion: int ) -> tuple[DOMElementNode, SelectorMap]: # Prepare arguments for the JavaScript function args = { 'doHighlightElements': highlight_elements, 'focusHighlightIndex': focus_element, 'viewportExpansion': viewport_expansion, 'debugMode': logger.getEffectiveLevel() == logging.DEBUG, } try: # Execute the JavaScript code in the browser page! # The JS code analyzes the live DOM and returns a structured result. eval_page = await self.page.evaluate(self.js_code, args) except Exception as e: logger.error('Error evaluating JavaScript: %s', e) raise # ... (optional debug logging) ... # Parse the result from JavaScript into Python objects return await self._construct_dom_tree(eval_page) async def _construct_dom_tree(self, eval_page: dict) -> tuple[DOMElementNode, SelectorMap]: # ... (logic to parse js_node_map from eval_page) ... # ... (loops through nodes, creates DOMElementNode/DOMTextNode objects) ... # ... (builds the tree structure by linking parents/children) ... # ... (populates the selector_map dictionary) ... # This uses the structures defined in dom/views.py # ... root_node = ... # Parsed root DOMElementNode selector_map = ... # Populated dictionary {index: DOMElementNode} return root_node, selector_map # ... other methods like get_clickable_elements ... ``` 3. **`buildDomTree.js` (Conceptual):** This JavaScript file (located at `dom/buildDomTree.js` in the library) is the core map-making logic that runs *inside the browser*. It traverses the live DOM, checks element visibility and interactivity using browser APIs (like `element.getBoundingClientRect()`, `window.getComputedStyle()`, `document.elementFromPoint()`), assigns the `highlight_index`, and packages the results into a structured format that the Python `DomService` can understand. *We don't need to understand the JS code itself, just its purpose.* 4. **Python Data Structures (`DOMElementNode`, `DOMTextNode`):** The results from the JavaScript are parsed into Python objects defined in `dom/views.py`. These dataclasses (`DOMElementNode`, `DOMTextNode`) hold the information about each mapped element or text segment. ## Conclusion DOM Representation, primarily handled by the `DomService`, is crucial for bridging the gap between the complex reality of a webpage (the DOM) and the Agent/LLM's need for a simplified, actionable understanding. By creating a structured `element_tree` and an indexed `selector_map`, it provides a clear map of interactive landmarks on the page, identified by simple `highlight_index` numbers. This map allows the LLM to make specific plans like "type into element [5]" or "click element [12]", which the Agent can then reliably translate into concrete actions. Now that we understand how the Agent sees the page, how does it actually *perform* those actions like clicking or typing? In the next chapter, we'll explore the component responsible for executing the LLM's plan: the [Action Controller & Registry](05_action_controller___registry.md). [Next Chapter: Action Controller & Registry](05_action_controller___registry.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/05_action_controller___registry.md ================================================ --- layout: default title: "Action Controller & Registry" parent: "Browser Use" nav_order: 5 --- # Chapter 5: Action Controller & Registry - The Agent's Hands and Toolbox In the [previous chapter](04_dom_representation.md), we saw how the `DomService` creates a simplified map (`DOMState`) of the webpage, allowing the Agent and its LLM planner to identify interactive elements like buttons and input fields using unique numbers (`highlight_index`). The LLM uses this map to decide *what* specific action to take next, like "click element [5]" or "type 'hello world' into element [12]". But how does the program actually *do* that? How does the abstract idea "click element [5]" turn into a real click inside the browser window managed by the [BrowserContext](03_browsercontext.md)? This is where the **Action Controller** and **Action Registry** come into play. They are the "hands" and "toolbox" that execute the Agent's decisions. ## What Problem Do They Solve? Imagine you have a detailed instruction manual (the LLM's plan) for building a model car. The manual tells you exactly which piece to pick up (`index=5`) and what to do with it ("click" or "attach"). However, you still need: 1. **A Toolbox:** A collection of all the tools you might need (screwdriver, glue, pliers). You need to know what tools are available. 2. **A Mechanic:** Someone (or you!) who can read the instruction ("Use the screwdriver on screw #5"), select the correct tool from the toolbox, and skillfully use it on the specified part. Without the toolbox and the mechanic, the instruction manual is useless. Similarly, the `Browser Use` Agent needs: 1. **Action Registry (The Toolbox):** A defined list of all possible actions the Agent can perform (e.g., `click_element`, `input_text`, `scroll_down`, `go_to_url`, `done`). This registry also holds details about each action, like what parameters it needs (e.g., `click_element` needs an `index`). 2. **Action Controller (The Mechanic):** A component that takes the specific action requested by the LLM (e.g., "execute `click_element` with `index=5`"), finds the corresponding function (the "tool") in the Registry, ensures the request is valid, and then executes that function using the [BrowserContext](03_browsercontext.md) (the "car"). The Controller and Registry solve the problem of translating the LLM's high-level plan into concrete, executable browser operations in a structured and reliable way. ## Meet the Toolbox and the Mechanic Let's break down these two closely related concepts: ### 1. Action Registry: The Toolbox (`controller/registry/service.py`) Think of the `Registry` as a carefully organized toolbox. Each drawer is labeled with the name of a tool (an action like `click_element`), and inside, you find the tool itself (the actual code function) along with its instructions (description and required parameters). * **Catalog of Actions:** It holds a dictionary where keys are action names (strings like `"click_element"`) and values are `RegisteredAction` objects containing: * The action's `name`. * A `description` (for humans and the LLM). * The actual Python `function` to call. * A `param_model` (a Pydantic model defining required parameters like `index` or `text`). * **Informs the LLM:** The `Registry` can generate a description of all available actions and their parameters. This description is given to the LLM (as part of the [System Prompt](02_system_prompt.md)) so it knows exactly what "tools" it's allowed to ask the Agent to use. ### 2. Action Controller: The Mechanic (`controller/service.py`) The `Controller` is the skilled mechanic who uses the tools from the Registry. * **Receives Instructions:** It gets the action request from the Agent. This request typically comes in the form of an `ActionModel` object, which represents the LLM's JSON output (e.g., `{"click_element": {"index": 5}}`). * **Selects the Tool:** It looks at the `ActionModel`, identifies the action name (`"click_element"`), and retrieves the corresponding `RegisteredAction` from the `Registry`. * **Validates Parameters:** It uses the action's `param_model` (e.g., `ClickElementAction`) to check if the provided parameters (`{"index": 5}`) are correct. * **Executes the Action:** It calls the actual Python function associated with the action (e.g., the `click_element` function), passing it the validated parameters and the necessary `BrowserContext` (so the function knows *which* browser tab to act upon). * **Reports the Result:** The action function performs the task (e.g., clicking the element) and returns an `ActionResult` object, indicating whether it succeeded, failed, or produced some output. The Controller passes this result back to the Agent. ## Using the Controller: Executing an Action In the Agent's main loop ([Chapter 1: Agent](01_agent.md)), after the LLM provides its plan as an `ActionModel`, the Agent simply hands this model over to the `Controller` to execute it. ```python # --- Simplified Agent step calling the Controller --- # Assume 'llm_response_model' is the ActionModel object parsed from LLM's JSON # Assume 'self.controller' is the Controller instance # Assume 'self.browser_context' is the current BrowserContext # ... inside the Agent's step method ... try: # Agent tells the Controller: "Execute this action!" action_result: ActionResult = await self.controller.act( action=llm_response_model, # The LLM's chosen action and parameters browser_context=self.browser_context # The browser tab to act within # Other context like LLMs for extraction might be passed too ) # Agent receives the result from the Controller print(f"Action executed. Result: {action_result.extracted_content}") if action_result.is_done: print("Task marked as done by the action!") if action_result.error: print(f"Action encountered an error: {action_result.error}") # Agent records this result in the history ([Message Manager](06_message_manager.md)) # ... except Exception as e: print(f"Failed to execute action: {e}") # Handle the error ``` **What happens here?** 1. The Agent has received `llm_response_model` (e.g., representing `{"click_element": {"index": 5}}`). 2. It calls `self.controller.act()`, passing the action model and the active `browser_context`. 3. The `controller.act()` method handles looking up the `"click_element"` function in the `Registry`, validating the `index` parameter, and calling the function to perform the click within the `browser_context`. 4. The `click_element` function executes (interacting with the browser via `BrowserContext` methods). 5. It returns an `ActionResult` (e.g., `ActionResult(extracted_content="Clicked button with index 5")`). 6. The Agent receives this `action_result` and proceeds. ## How it Works Under the Hood: The Execution Flow Let's trace the journey of an action request from the Agent to the browser click: ```mermaid sequenceDiagram participant Agent participant Controller participant Registry participant ClickFunc as click_element Function participant BC as BrowserContext Note over Agent: LLM decided: click_element(index=5) Agent->>Controller: act(action={"click_element": {"index": 5}}, browser_context=BC) Note over Controller: Identify action and params Controller->>Controller: action_name = "click_element", params = {"index": 5} Note over Controller: Ask Registry for the tool Controller->>Registry: Get action definition for "click_element" Registry-->>Controller: Return RegisteredAction(name="click_element", function=ClickFunc, param_model=ClickElementAction, ...) Note over Controller: Validate params using param_model Controller->>Controller: ClickElementAction(index=5) # Validation OK Note over Controller: Execute the function Controller->>ClickFunc: ClickFunc(params=ClickElementAction(index=5), browser=BC) Note over ClickFunc: Perform the click via BrowserContext ClickFunc->>BC: Find element with index 5 BC-->>ClickFunc: Element reference ClickFunc->>BC: Execute click on element BC-->>ClickFunc: Click successful ClickFunc-->>Controller: Return ActionResult(extracted_content="Clicked button...") Controller-->>Agent: Return ActionResult ``` This diagram shows the Controller orchestrating the process: receiving the request, consulting the Registry, validating, calling the specific action function, and returning the result. ## Diving Deeper into the Code Let's peek at simplified versions of the key files. ### 1. Registering Actions (`controller/registry/service.py`) Actions are typically registered using a decorator `@registry.action`. ```python # --- File: controller/registry/service.py (Simplified Registry) --- from typing import Callable, Type from pydantic import BaseModel # Assume ActionModel, RegisteredAction are defined in views.py class Registry: def __init__(self, exclude_actions: list[str] = []): self.registry: dict[str, RegisteredAction] = {} self.exclude_actions = exclude_actions # ... other initializations ... def _create_param_model(self, function: Callable) -> Type[BaseModel]: """Creates a Pydantic model from function signature (simplified)""" # ... (Inspects function signature to build a model) ... # Example: for func(index: int, text: str), creates a model # class func_parameters(ActionModel): # index: int # text: str # return func_parameters pass # Placeholder for complex logic def action( self, description: str, param_model: Type[BaseModel] | None = None, ): """Decorator for registering actions""" def decorator(func: Callable): if func.__name__ in self.exclude_actions: return func # Skip excluded # If no specific param_model provided, try to generate one actual_param_model = param_model # Or self._create_param_model(func) if needed # Ensure function is awaitable (async) wrapped_func = func # Assume func is already async for simplicity action = RegisteredAction( name=func.__name__, description=description, function=wrapped_func, param_model=actual_param_model, ) self.registry[func.__name__] = action # Add to the toolbox! print(f"Action '{func.__name__}' registered.") return func return decorator def get_prompt_description(self) -> str: """Get a description of all actions for the prompt (simplified)""" descriptions = [] for action in self.registry.values(): # Format description for LLM (e.g., "click_element: Click element {index: {'type': 'integer'}}") descriptions.append(f"{action.name}: {action.description} {action.param_model.schema()}") return "\n".join(descriptions) async def execute_action(self, action_name: str, params: dict, browser, **kwargs) -> Any: """Execute a registered action (simplified)""" if action_name not in self.registry: raise ValueError(f"Action {action_name} not found") action = self.registry[action_name] try: # Validate params using the registered Pydantic model validated_params = action.param_model(**params) # Call the actual action function with validated params and browser context # Assumes function takes validated_params model and browser result = await action.function(validated_params, browser=browser, **kwargs) return result except Exception as e: raise RuntimeError(f"Error executing {action_name}: {e}") from e ``` This shows how the `@registry.action` decorator takes a function, its description, and parameter model, and stores them in the `registry` dictionary. `execute_action` is the core method used by the `Controller` to run a specific action. ### 2. Defining Action Parameters (`controller/views.py`) Each action often has its own Pydantic model to define its expected parameters. ```python # --- File: controller/views.py (Simplified Action Parameter Models) --- from pydantic import BaseModel from typing import Optional # Example parameter model for the 'click_element' action class ClickElementAction(BaseModel): index: int # The highlight_index of the element to click xpath: Optional[str] = None # Optional hint (usually index is enough) # Example parameter model for the 'input_text' action class InputTextAction(BaseModel): index: int # The highlight_index of the input field text: str # The text to type xpath: Optional[str] = None # Optional hint # Example parameter model for the 'done' action (task completion) class DoneAction(BaseModel): text: str # A final message or result success: bool # Was the overall task successful? # ... other action models like GoToUrlAction, ScrollAction etc. ... ``` These models ensure that when the Controller receives parameters like `{"index": 5}`, it can validate that `index` is indeed an integer as required by `ClickElementAction`. ### 3. The Controller Service (`controller/service.py`) The `Controller` class ties everything together. It initializes the `Registry` and registers the default browser actions. Its main job is the `act` method. ```python # --- File: controller/service.py (Simplified Controller) --- import logging from browser_use.agent.views import ActionModel, ActionResult # Input/Output types from browser_use.browser.context import BrowserContext # Needed by actions from browser_use.controller.registry.service import Registry # The toolbox from browser_use.controller.views import ClickElementAction, InputTextAction, DoneAction # Param models logger = logging.getLogger(__name__) class Controller: def __init__(self, exclude_actions: list[str] = []): self.registry = Registry(exclude_actions=exclude_actions) # Initialize the toolbox # --- Register Default Actions --- # (Registration happens when Controller is created) @self.registry.action("Click element", param_model=ClickElementAction) async def click_element(params: ClickElementAction, browser: BrowserContext): logger.info(f"Attempting to click element index {params.index}") # --- Actual click logic using browser object --- element_node = await browser.get_dom_element_by_index(params.index) await browser._click_element_node(element_node) # Internal browser method # --- msg = f"🖱️ Clicked element with index {params.index}" return ActionResult(extracted_content=msg, include_in_memory=True) @self.registry.action("Input text into an element", param_model=InputTextAction) async def input_text(params: InputTextAction, browser: BrowserContext): logger.info(f"Attempting to type into element index {params.index}") # --- Actual typing logic using browser object --- element_node = await browser.get_dom_element_by_index(params.index) await browser._input_text_element_node(element_node, params.text) # Internal method # --- msg = f"⌨️ Input text into index {params.index}" return ActionResult(extracted_content=msg, include_in_memory=True) @self.registry.action("Complete task", param_model=DoneAction) async def done(params: DoneAction): logger.info(f"Task completion requested. Success: {params.success}") return ActionResult(is_done=True, success=params.success, extracted_content=params.text) # ... registration for scroll_down, go_to_url, etc. ... async def act( self, action: ActionModel, # The ActionModel from the LLM browser_context: BrowserContext, # The context to act within **kwargs # Other potential context (LLMs, etc.) ) -> ActionResult: """Execute an action defined in the ActionModel""" try: # ActionModel might look like: ActionModel(click_element=ClickElementAction(index=5)) # model_dump gets {'click_element': {'index': 5}} action_data = action.model_dump(exclude_unset=True) for action_name, params in action_data.items(): if params is not None: logger.debug(f"Executing action: {action_name} with params: {params}") # Call the registry's execute method result = await self.registry.execute_action( action_name=action_name, params=params, browser=browser_context, # Pass the essential context **kwargs # Pass any other context needed by actions ) # Ensure result is ActionResult or convert it if isinstance(result, ActionResult): return result if isinstance(result, str): return ActionResult(extracted_content=result) return ActionResult() # Default empty result if action returned None logger.warning("ActionModel had no action to execute.") return ActionResult(error="No action specified in the model") except Exception as e: logger.error(f"Error during controller.act: {e}", exc_info=True) return ActionResult(error=str(e)) # Return error in ActionResult ``` The `Controller` registers all the standard browser actions during initialization. The `act` method then dynamically finds and executes the requested action using the `Registry`. ## Conclusion The **Action Registry** acts as the definitive catalog or "toolbox" of all operations the `Browser Use` Agent can perform. The **Action Controller** is the "mechanic" that interprets the LLM's plan, selects the appropriate tool from the Registry, and executes it within the specified [BrowserContext](03_browsercontext.md). Together, they provide a robust and extensible way to translate high-level instructions into low-level browser interactions, forming the crucial link between the Agent's "brain" (LLM planner) and its "hands" (browser manipulation). Now that we know how actions are chosen and executed, how does the Agent keep track of the conversation with the LLM, including the history of states observed and actions taken? We'll explore this in the next chapter on the [Message Manager](06_message_manager.md). [Next Chapter: Message Manager](06_message_manager.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/06_message_manager.md ================================================ --- layout: default title: "Message Manager" parent: "Browser Use" nav_order: 6 --- # Chapter 6: Message Manager - Keeping the Conversation Straight In the [previous chapter](05_action_controller___registry.md), we learned how the `Action Controller` and `Registry` act as the Agent's "hands" and "toolbox", executing the specific actions decided by the LLM planner. But how does the LLM get all the information it needs to make those decisions in the first place? How does the Agent keep track of the ongoing conversation, including what it "saw" on the page and what happened after each action? Imagine you're having a long, multi-step discussion with an assistant about a complex task. If the assistant has a poor memory, they might forget earlier instructions, the current status, or previous results, making it impossible to proceed correctly. LLMs face a similar challenge: they need the conversation history for context, but they have a limited memory (called the "context window"). This is the problem the **Message Manager** solves. ## What Problem Does the Message Manager Solve? The `Agent` needs to have a conversation with the LLM. This conversation isn't just chat; it includes: 1. **Initial Instructions:** The core rules from the [System Prompt](02_system_prompt.md). 2. **The Task:** The overall goal the Agent needs to achieve. 3. **Observations:** What the Agent currently "sees" in the browser ([BrowserContext](03_browsercontext.md) state, including the [DOM Representation](04_dom_representation.md)). 4. **Action Results:** What happened after the last action was performed ([Action Controller & Registry](05_action_controller___registry.md)). 5. **LLM's Plan:** The sequence of actions the LLM decided on. The Message Manager solves several key problems: * **Organizes History:** It structures the conversation chronologically, keeping track of who said what (System, User/Agent State, AI/LLM Plan). * **Formats Messages:** It ensures the browser state, action results, and even images are formatted correctly so the LLM can understand them. * **Tracks Size:** It keeps count of the "tokens" (roughly, words or parts of words) used in the conversation history. * **Manages Limits:** It helps prevent the conversation history from exceeding the LLM's context window limit, potentially by removing older parts of the conversation if it gets too long. Think of the `MessageManager` as a meticulous secretary for the Agent-LLM conversation. It takes clear, concise notes, presents the current situation accurately, and ensures the conversation doesn't ramble on for too long, keeping everything within the LLM's "attention span". ## Meet the Message Manager: The Conversation Secretary The `MessageManager` (found in `agent/message_manager/service.py`) is responsible for managing the list of messages that are sent to the LLM in each step. Here are its main jobs: 1. **Initialization:** When the `Agent` starts, the `MessageManager` is created. It immediately adds the foundational messages: * The `SystemMessage` containing the rules from the [System Prompt](02_system_prompt.md). * A `HumanMessage` stating the overall `task`. * Other initial setup messages (like examples or sensitive data placeholders). 2. **Adding Browser State:** Before asking the LLM what to do next, the `Agent` gets the current `BrowserState`. It then tells the `MessageManager` to add this information as a `HumanMessage`. This message includes the simplified DOM map, the current URL, and potentially a screenshot (if `use_vision` is enabled). It also includes the results (`ActionResult`) from the *previous* step, so the LLM knows what happened last. 3. **Adding LLM Output:** After the LLM responds with its plan (`AgentOutput`), the `Agent` tells the `MessageManager` to add this plan as an `AIMessage`. This typically includes the LLM's reasoning and the list of actions to perform. 4. **Adding Action Results (Indirectly):** The results from the `Controller.act` call (`ActionResult`) aren't added as separate messages *after* the action. Instead, they are included in the *next* `HumanMessage` that contains the browser state (see step 2). This keeps the context tight: "Here's the current page, and here's what happened right before we got here." 5. **Providing Messages to LLM:** When the `Agent` is ready to call the LLM, it asks the `MessageManager` for the current conversation history (`get_messages()`). 6. **Token Management:** Every time a message is added, the `MessageManager` calculates how many tokens it adds (`_count_tokens`) and updates the total. If the total exceeds the limit (`max_input_tokens`), it might trigger a truncation strategy (`cut_messages`) to shorten the history, usually by removing parts of the oldest user state message or removing the image first. ## How the Agent Uses the Message Manager Let's revisit the simplified `Agent.step` method from [Chapter 1](01_agent.md) and highlight the `MessageManager` interactions (using `self._message_manager`): ```python # --- File: agent/service.py (Simplified step method - Highlighting MessageManager) --- class Agent: # ... (init, run) ... async def step(self, step_info: Optional[AgentStepInfo] = None) -> None: logger.info(f"📍 Step {self.state.n_steps}") state = None model_output = None result: list[ActionResult] = [] try: # 1. Get current state from the browser state = await self.browser_context.get_state() # Uses BrowserContext # 2. Add state + PREVIOUS result to message history via MessageManager # 'self.state.last_result' holds the outcome of the *previous* step's action self._message_manager.add_state_message( state, self.state.last_result, # Result from previous action step_info, self.settings.use_vision # Tell it whether to include image ) # 3. Get the complete, formatted message history for the LLM input_messages = self._message_manager.get_messages() # 4. Get LLM's decision on the next action(s) model_output = await self.get_next_action(input_messages) # Calls the LLM # --- Agent increments step counter --- self.state.n_steps += 1 # 5. Remove the potentially large state message before adding the compact AI response # (This is an optimization mentioned in the provided code) self._message_manager._remove_last_state_message() # 6. Add the LLM's response (the plan) to the history self._message_manager.add_model_output(model_output) # 7. Execute the action(s) using the Controller result = await self.multi_act(model_output.action) # Uses Controller # 8. Store the result of THIS action. It will be used in the *next* step's # call to self._message_manager.add_state_message() self.state.last_result = result # ... (Record step details, handle success/failure) ... except Exception as e: # Handle errors... result = await self._handle_step_error(e) self.state.last_result = result # ... (finally block) ... ``` This flow shows the cycle: add state/previous result -> get messages -> call LLM -> add LLM response -> execute action -> store result for *next* state message. ## How it Works Under the Hood: Managing the Flow Let's visualize the key interactions during one step of the Agent loop involving the `MessageManager`: ```mermaid sequenceDiagram participant Agent participant BC as BrowserContext participant MM as MessageManager participant LLM participant Controller Note over Agent: Start of step Agent->>BC: get_state() BC-->>Agent: Current BrowserState (DOM map, URL, screenshot?) Note over Agent: Have BrowserState and `last_result` from previous step Agent->>MM: add_state_message(BrowserState, last_result) MM->>MM: Format state/result into HumanMessage (with text/image) MM->>MM: Calculate tokens for new message MM->>MM: Add HumanMessage to internal history list MM->>MM: Update total token count MM->>MM: Check token limit, potentially call cut_messages() Note over Agent: Ready to ask LLM Agent->>MM: get_messages() MM-->>Agent: Return List[BaseMessage] (System, Task, State1, Plan1, State2...) Agent->>LLM: Invoke LLM with message list LLM-->>Agent: LLM Response (AgentOutput containing plan) Note over Agent: Got LLM's plan Agent->>MM: _remove_last_state_message() # Optimization MM->>MM: Remove last (large) HumanMessage from list Agent->>MM: add_model_output(AgentOutput) MM->>MM: Format plan into AIMessage (with tool calls) MM->>MM: Calculate tokens for AIMessage MM->>MM: Add AIMessage to internal history list MM->>MM: Update total token count Note over Agent: Ready to execute plan Agent->>Controller: multi_act(AgentOutput.action) Controller-->>Agent: List[ActionResult] (Result of this step's actions) Agent->>Agent: Store ActionResult in `self.state.last_result` (for next step) Note over Agent: End of step ``` This shows how `MessageManager` sits between the Agent, the Browser State, and the LLM, managing the history list and token counts. ## Diving Deeper into the Code (`agent/message_manager/service.py`) Let's look at simplified versions of key methods in `MessageManager`. **1. Initialization (`__init__` and `_init_messages`)** When the `Agent` creates the `MessageManager`, it passes the task and the already-formatted `SystemMessage`. ```python # --- File: agent/message_manager/service.py (Simplified __init__) --- from langchain_core.messages import SystemMessage, HumanMessage, AIMessage, ToolMessage # ... other imports ... from browser_use.agent.views import MessageManagerState # Internal state storage from browser_use.agent.message_manager.views import MessageMetadata, ManagedMessage # Message wrapper class MessageManager: def __init__( self, task: str, system_message: SystemMessage, # Received from Agent settings: MessageManagerSettings = MessageManagerSettings(), state: MessageManagerState = MessageManagerState(), # Stores history ): self.task = task self.settings = settings # Max tokens, image settings, etc. self.state = state # Holds the 'history' object self.system_prompt = system_message # Only initialize if history is empty (e.g., not resuming from saved state) if len(self.state.history.messages) == 0: self._init_messages() def _init_messages(self) -> None: """Add the initial fixed messages to the history.""" # Add the main system prompt (rules) self._add_message_with_tokens(self.system_prompt) # Add the user's task task_message = HumanMessage( content=f'Your ultimate task is: """{self.task}"""...' ) self._add_message_with_tokens(task_message) # Add other setup messages (context, sensitive data info, examples) # ... (simplified - see full code for details) ... # Example: Add a placeholder for where the main history begins placeholder_message = HumanMessage(content='[Your task history memory starts here]') self._add_message_with_tokens(placeholder_message) ``` This sets up the foundational context for the LLM. **2. Adding Browser State (`add_state_message`)** This method takes the current `BrowserState` and the previous `ActionResult`, formats them into a `HumanMessage` (potentially multi-modal with image and text parts), and adds it to the history. ```python # --- File: agent/message_manager/service.py (Simplified add_state_message) --- # ... imports ... from browser_use.browser.views import BrowserState from browser_use.agent.views import ActionResult, AgentStepInfo from browser_use.agent.prompts import AgentMessagePrompt # Helper to format state class MessageManager: # ... (init) ... def add_state_message( self, state: BrowserState, # The current view of the browser result: Optional[List[ActionResult]] = None, # Result from *previous* action step_info: Optional[AgentStepInfo] = None, use_vision=True, # Flag to include screenshot ) -> None: """Add browser state and previous result as a human message.""" # Add any 'memory' messages from the previous result first (if any) if result: for r in result: if r.include_in_memory and (r.extracted_content or r.error): content = f"Action result: {r.extracted_content}" if r.extracted_content else f"Action error: {r.error}" msg = HumanMessage(content=content) self._add_message_with_tokens(msg) result = None # Don't include again in the main state message # Use a helper class to format the BrowserState (+ optional remaining result) # into the correct message structure (text + optional image) state_prompt = AgentMessagePrompt( state, result, # Pass any remaining result info include_attributes=self.settings.include_attributes, step_info=step_info, ) # Get the formatted message (could be complex list for vision) state_message = state_prompt.get_user_message(use_vision) # Add the formatted message (with token calculation) to history self._add_message_with_tokens(state_message) ``` **3. Adding Model Output (`add_model_output`)** This takes the LLM's plan (`AgentOutput`) and formats it as an `AIMessage` with specific "tool calls" structure that many models expect. ```python # --- File: agent/message_manager/service.py (Simplified add_model_output) --- # ... imports ... from browser_use.agent.views import AgentOutput class MessageManager: # ... (init, add_state_message) ... def add_model_output(self, model_output: AgentOutput) -> None: """Add model output (the plan) as an AI message with tool calls.""" # Format the output according to OpenAI's tool calling standard tool_calls = [ { 'name': 'AgentOutput', # The 'tool' name 'args': model_output.model_dump(mode='json', exclude_unset=True), # The LLM's JSON output 'id': str(self.state.tool_id), # Unique ID for the call 'type': 'tool_call', } ] # Create the AIMessage containing the tool calls msg = AIMessage( content='', # Content is often empty when using tool calls tool_calls=tool_calls, ) # Add it to history self._add_message_with_tokens(msg) # Add a corresponding empty ToolMessage (required by some models) self.add_tool_message(content='') # Content depends on tool execution result def add_tool_message(self, content: str) -> None: """Add tool message to history (often confirms tool call receipt/result)""" # ToolMessage links back to the AIMessage's tool_call_id msg = ToolMessage(content=content, tool_call_id=str(self.state.tool_id)) self.state.tool_id += 1 # Increment for next potential tool call self._add_message_with_tokens(msg) ``` **4. Adding Messages and Counting Tokens (`_add_message_with_tokens`, `_count_tokens`)** This is the core function called by others to add any message to the history, ensuring token counts are tracked. ```python # --- File: agent/message_manager/service.py (Simplified _add_message_with_tokens) --- # ... imports ... from langchain_core.messages import BaseMessage from browser_use.agent.message_manager.views import MessageMetadata, ManagedMessage class MessageManager: # ... (other methods) ... def _add_message_with_tokens(self, message: BaseMessage, position: int | None = None) -> None: """Internal helper to add any message with its token count metadata.""" # 1. Optionally filter sensitive data (replace actual data with placeholders) # if self.settings.sensitive_data: # message = self._filter_sensitive_data(message) # Simplified # 2. Count the tokens in the message token_count = self._count_tokens(message) # 3. Create metadata object metadata = MessageMetadata(tokens=token_count) # 4. Add the message and its metadata to the history list # (self.state.history is a MessageHistory object) self.state.history.add_message(message, metadata, position) # Note: self.state.history.add_message also updates the total token count # 5. Check if history exceeds token limit and truncate if needed self.cut_messages() # Check and potentially trim history def _count_tokens(self, message: BaseMessage) -> int: """Estimate tokens in a message.""" tokens = 0 if isinstance(message.content, list): # Multi-modal (text + image) for item in message.content: if isinstance(item, dict) and 'image_url' in item: # Add fixed cost for images tokens += self.settings.image_tokens elif isinstance(item, dict) and 'text' in item: # Estimate tokens based on text length tokens += len(item['text']) // self.settings.estimated_characters_per_token elif isinstance(message.content, str): # Text message text = message.content if hasattr(message, 'tool_calls'): # Add tokens for tool call structure text += str(getattr(message, 'tool_calls', '')) tokens += len(text) // self.settings.estimated_characters_per_token return tokens def cut_messages(self): """Trim messages if total tokens exceed the limit.""" # Calculate how many tokens we are over the limit diff = self.state.history.current_tokens - self.settings.max_input_tokens if diff <= 0: return # We are within limits logger.debug(f"Token limit exceeded by {diff}. Trimming history.") # Strategy: # 1. Try removing the image from the *last* (most recent) state message if present. # (Code logic finds the last message, checks content list, removes image item, updates counts) # ... (Simplified - see full code for image removal logic) ... # 2. If still over limit after image removal (or no image was present), # trim text content from the *end* of the last state message. # Calculate proportion to remove, shorten string, create new message. # ... (Simplified - see full code for text trimming logic) ... # Ensure we don't get stuck if trimming isn't enough (raise error) if self.state.history.current_tokens > self.settings.max_input_tokens: raise ValueError("Max token limit reached even after trimming.") ``` This shows the basic mechanics of adding messages, calculating their approximate size, and applying strategies to keep the history within the LLM's context window limit. ## Conclusion The `MessageManager` is the Agent's conversation secretary. It meticulously records the dialogue between the Agent (reporting browser state and action results) and the LLM (providing analysis and action plans), starting from the initial `System Prompt` and task definition. Crucially, it formats these messages correctly, tracks the conversation's size using token counts, and implements strategies to keep the history concise enough for the LLM's limited context window. Without the `MessageManager`, the Agent would quickly lose track of the conversation, and the LLM wouldn't have the necessary context to guide the browser effectively. Many of the objects managed and passed around by the `MessageManager`, like `BrowserState`, `ActionResult`, and `AgentOutput`, are defined as specific data structures. In the next chapter, we'll take a closer look at these important **Data Structures (Views)**. [Next Chapter: Data Structures (Views)](07_data_structures__views_.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/07_data_structures__views_.md ================================================ --- layout: default title: "Data Structures (Views)" parent: "Browser Use" nav_order: 7 --- # Chapter 7: Data Structures (Views) - The Project's Blueprints In the [previous chapter](06_message_manager.md), we saw how the `MessageManager` acts like a secretary, carefully organizing the conversation between the [Agent](01_agent.md) and the LLM. It manages different pieces of information – the browser's current state, the LLM's plan, the results of actions, and more. But how do all these different components – the Agent, the LLM parser, the [BrowserContext](03_browsercontext.md), the [Action Controller & Registry](05_action_controller___registry.md), and the [Message Manager](06_message_manager.md) – ensure they understand each other perfectly? If the LLM gives a plan in one format, and the Controller expects it in another, things will break! Imagine trying to build furniture using instructions written in a language you don't fully understand, or trying to fill out a form where every section uses a different layout. It would be confusing and error-prone. We need a shared, consistent language and format. This is where **Data Structures (Views)** come in. They act as the official blueprints or standardized forms for all the important information passed around within the `Browser Use` project. ## What Problem Do Data Structures Solve? In a complex system like `Browser Use`, many components need to exchange data: * The [BrowserContext](03_browsercontext.md) needs to package up the current state of the webpage. * The [Agent](01_agent.md) needs to understand the LLM's multi-step plan. * The [Action Controller & Registry](05_action_controller___registry.md) needs to know exactly which action to perform and with what specific parameters (like which element index to click). * The Controller needs to report back the result of an action in a predictable way. Without a standard format for each piece of data, you might encounter problems like: * Misinterpreting data (e.g., is `5` an element index or a quantity?). * Missing required information. * Inconsistent naming (`element_id` vs `index` vs `element_number`). * Difficulty debugging when data looks different every time. Data Structures (Views) solve this by defining **strict, consistent blueprints** for the data. Everyone agrees to use these blueprints, ensuring smooth communication and preventing errors. ## Meet Pydantic: The Blueprint Maker and Checker In `Browser Use`, these blueprints are primarily defined using a popular Python library called **Pydantic**. Think of Pydantic like a combination of: 1. **A Blueprint Designer:** It provides an easy way to define the structure of your data using standard Python type hints (like `str` for text, `int` for whole numbers, `bool` for True/False, `list` for lists). 2. **A Quality Inspector:** When data comes in (e.g., from the LLM or from an action's result), Pydantic automatically checks if it matches the blueprint. Does it have all the required fields? Are the data types correct? If not, Pydantic raises an error, stopping bad data before it causes problems later. These Pydantic models (our blueprints) are often stored in files named `views.py` within different component directories (like `agent/views.py`, `browser/views.py`), which is why we sometimes call them "Views". ## Key Blueprints in `Browser Use` Let's look at some of the most important data structures used in the project. Don't worry about memorizing every detail; focus on *what kind* of information each blueprint holds and *who* uses it. *(Note: These are simplified representations. The actual models might have more fields or features.)* ### 1. `BrowserState` (from `browser/views.py`) * **Purpose:** Represents a complete snapshot of the browser's state at a specific moment. * **Blueprint Contents (Simplified):** * `url`: The current web address (string). * `title`: The title of the webpage (string). * `element_tree`: The simplified map of the webpage content (from [DOM Representation](04_dom_representation.md)). * `selector_map`: The lookup map for interactive elements (from [DOM Representation](04_dom_representation.md)). * `screenshot`: An optional image of the page (string, base64 encoded). * `tabs`: Information about other open tabs in this context (list). * **Who Uses It:** * Created by: [BrowserContext](03_browsercontext.md) (`get_state()` method). * Used by: [Agent](01_agent.md) (to see the current situation), [Message Manager](06_message_manager.md) (to store in history). ```python # --- Conceptual Pydantic Model --- # File: browser/views.py (Simplified Example) from pydantic import BaseModel from typing import Optional, List, Dict # For type hints # Assume DOMElementNode and TabInfo are defined elsewhere class BrowserState(BaseModel): url: str title: str element_tree: Optional[object] # Simplified: Actual type is DOMElementNode selector_map: Optional[Dict[int, object]] # Simplified: Actual type is SelectorMap screenshot: Optional[str] = None # Optional field tabs: List[object] = [] # Simplified: Actual type is TabInfo # Pydantic ensures that when a BrowserState is created, # 'url' and 'title' MUST be provided as strings. ``` ### 2. `ActionModel` (from `controller/registry/views.py`) * **Purpose:** Represents a *single* specific action the LLM wants to perform, including its parameters. This model is often created *dynamically* based on the actions available in the [Action Controller & Registry](05_action_controller___registry.md). * **Blueprint Contents (Example for `click_element`):** * `index`: The `highlight_index` of the element to click (integer). * `xpath`: An optional hint about the element's location (string). * **Blueprint Contents (Example for `input_text`):** * `index`: The `highlight_index` of the input field (integer). * `text`: The text to type (string). * **Who Uses It:** * Defined by/Registered in: [Action Controller & Registry](05_action_controller___registry.md). * Created based on: LLM output (often part of `AgentOutput`). * Used by: [Action Controller & Registry](05_action_controller___registry.md) (to validate parameters and know what function to call). ```python # --- Conceptual Pydantic Models --- # File: controller/views.py (Simplified Examples) from pydantic import BaseModel from typing import Optional class ClickElementAction(BaseModel): index: int xpath: Optional[str] = None # Optional hint class InputTextAction(BaseModel): index: int text: str xpath: Optional[str] = None # Optional hint # Base model that dynamically holds ONE of the above actions class ActionModel(BaseModel): # Pydantic allows models like this where only one field is expected # e.g., ActionModel(click_element=ClickElementAction(index=5)) # or ActionModel(input_text=InputTextAction(index=12, text="hello")) click_element: Optional[ClickElementAction] = None input_text: Optional[InputTextAction] = None # ... fields for other possible actions (scroll, done, etc.) ... pass # More complex logic handles ensuring only one action is present ``` ### 3. `AgentOutput` (from `agent/views.py`) * **Purpose:** Represents the complete plan received from the LLM after it analyzes the current state. This is the structure the [System Prompt](02_system_prompt.md) tells the LLM to follow. * **Blueprint Contents (Simplified):** * `current_state`: The LLM's thoughts/reasoning (a nested structure, often called `AgentBrain`). * `action`: A *list* of one or more `ActionModel` objects representing the steps the LLM wants to take. * **Who Uses It:** * Created by: The [Agent](01_agent.md) parses the LLM's raw JSON output into this structure. * Used by: [Agent](01_agent.md) (to understand the plan), [Message Manager](06_message_manager.md) (to store the plan in history), [Action Controller & Registry](05_action_controller___registry.md) (reads the `action` list). ```python # --- Conceptual Pydantic Model --- # File: agent/views.py (Simplified Example) from pydantic import BaseModel from typing import List # Assume ActionModel and AgentBrain are defined elsewhere class AgentOutput(BaseModel): current_state: object # Simplified: Actual type is AgentBrain action: List[ActionModel] # A list of actions to execute # Pydantic ensures the LLM output MUST have 'current_state' and 'action', # and that 'action' MUST be a list containing valid ActionModel objects. ``` ### 4. `ActionResult` (from `agent/views.py`) * **Purpose:** Represents the outcome after the [Action Controller & Registry](05_action_controller___registry.md) attempts to execute a single action. * **Blueprint Contents (Simplified):** * `is_done`: Did this action signal the end of the overall task? (boolean, optional). * `success`: If done, was the task successful overall? (boolean, optional). * `extracted_content`: Any text result from the action (e.g., "Clicked button X") (string, optional). * `error`: Any error message if the action failed (string, optional). * `include_in_memory`: Should this result be explicitly shown to the LLM next time? (boolean). * **Who Uses It:** * Created by: Functions within the [Action Controller & Registry](05_action_controller___registry.md) (like `click_element`). * Used by: [Agent](01_agent.md) (to check status, record results), [Message Manager](06_message_manager.md) (includes info in the next state message sent to LLM). ```python # --- Conceptual Pydantic Model --- # File: agent/views.py (Simplified Example) from pydantic import BaseModel from typing import Optional class ActionResult(BaseModel): is_done: Optional[bool] = False success: Optional[bool] = None extracted_content: Optional[str] = None error: Optional[str] = None include_in_memory: bool = False # Default to False # Pydantic helps ensure results are consistently structured. # For example, 'is_done' must be True or False if provided. ``` ## The Power of Blueprints: Ensuring Consistency Using Pydantic models for these data structures provides a huge benefit: **automatic validation**. Imagine the LLM sends back a plan, but it forgets to include the `index` for a `click_element` action. ```json // Bad LLM Response (Missing 'index') { "current_state": { ... }, "action": [ { "click_element": { "xpath": "//button[@id='submit']" // 'index' is missing! } } ] } ``` When the [Agent](01_agent.md) tries to parse this JSON into the `AgentOutput` Pydantic model, Pydantic will immediately notice that the `index` field (which is required by the `ClickElementAction` blueprint) is missing. It will raise a `ValidationError`. ```python # --- Conceptual Agent Code --- import pydantic # Assume AgentOutput is the Pydantic model defined earlier # Assume 'llm_json_response' contains the bad JSON from above try: # Try to create the AgentOutput object from the LLM's response llm_plan = AgentOutput.model_validate_json(llm_json_response) # If validation succeeds, proceed... print("LLM Plan Validated:", llm_plan) except pydantic.ValidationError as e: # Pydantic catches the error! print(f"Validation Error: The LLM response didn't match the blueprint!") print(e) # The Agent can now handle this error gracefully, # maybe asking the LLM to try again, instead of crashing later. ``` This automatic checking catches errors early, preventing the [Action Controller & Registry](05_action_controller___registry.md) from receiving incomplete instructions and making the whole system much more robust and easier to debug. It enforces the "contract" between different components. ## Under the Hood: Simple Classes These data structures are simply Python classes, mostly inheriting from `pydantic.BaseModel` or defined using Python's built-in `dataclass`. They don't contain complex logic themselves; their main job is to define the *shape* and *type* of the data. You'll find their definitions scattered across the various `views.py` files within the project's component directories (like `agent/`, `browser/`, `controller/`, `dom/`). Think of them as the official vocabulary and grammar rules that all the components agree to use when communicating. ## Conclusion Data Structures (Views), primarily defined using Pydantic models, are the essential blueprints that ensure consistent and reliable communication within the `Browser Use` project. They act like standardized forms for `BrowserState`, `AgentOutput`, `ActionModel`, and `ActionResult`, making sure every component knows exactly what kind of data to expect and how to interpret it. By defining these clear structures and leveraging Pydantic's automatic validation, `Browser Use` prevents misunderstandings between components, catches errors early, and makes the overall system more robust and maintainable. These standardized structures also make it easier to log and understand what's happening in the system. Speaking of logging and understanding the system's behavior, how can we monitor the Agent's performance and gather data for improvement? In the next and final chapter, we'll explore the [Telemetry Service](08_telemetry_service.md). [Next Chapter: Telemetry Service](08_telemetry_service.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/08_telemetry_service.md ================================================ --- layout: default title: "Telemetry Service" parent: "Browser Use" nav_order: 8 --- # Chapter 8: Telemetry Service - Helping Improve the Project (Optional) In the [previous chapter](07_data_structures__views_.md), we explored the essential blueprints (`Data Structures (Views)`) that keep communication clear and consistent between all the parts of `Browser Use`. We saw how components like the [Agent](01_agent.md) and the [Action Controller & Registry](05_action_controller___registry.md) use these blueprints to exchange information reliably. Now, let's think about the project itself. How do the developers who build `Browser Use` know if it's working well for users? How do they find out about common errors or which features are most popular, so they can make the tool better? ## What Problem Does the Telemetry Service Solve? Imagine you released a new tool, like `Browser Use`. You want it to be helpful, but you don't know how people are actually using it. Are they running into unexpected errors? Are certain actions (like clicking vs. scrolling) causing problems? Is the performance okay? Without some feedback, it's hard to know where to focus improvements. One way to get feedback is through bug reports or feature requests, but that only captures a small fraction of user experiences. We need a way to get a broader, anonymous picture of how the tool is performing "in the wild." The **Telemetry Service** solves this by providing an *optional* and *anonymous* way to send basic usage statistics back to the project developers. Think of it like an anonymous suggestion box or an automatic crash report that doesn't include any personal information. **Crucially:** This service is designed to protect user privacy. It doesn't collect website content, personal data, or anything sensitive. It only sends anonymous statistics about the tool's operation, and **it can be completely disabled**. ## Meet `ProductTelemetry`: The Anonymous Reporter The component responsible for this is the `ProductTelemetry` service, found in `telemetry/service.py`. * **Collects Usage Data:** It gathers anonymized information about events like: * When an [Agent](01_agent.md) starts or finishes a run. * Details about each step the Agent takes (like which actions were used). * Errors encountered during agent runs. * Which actions are defined in the [Action Controller & Registry](05_action_controller___registry.md). * **Anonymizes Data:** It uses a randomly generated user ID (stored locally, not linked to you) to group events from the same installation without knowing *who* the user is. * **Sends Data:** It sends this anonymous data to a secure third-party service (PostHog) used by the developers to analyze trends and identify potential issues. * **Optional:** You can easily turn it off. ## How is Telemetry Used? (Mostly Automatic) You usually don't interact with the `ProductTelemetry` service directly. Instead, other components like the `Agent` and `Controller` automatically call it at key moments. **Example: Agent Run Start/End** When you create an `Agent` and call `agent.run()`, the Agent automatically notifies the Telemetry Service. ```python # --- File: agent/service.py (Simplified Agent run method) --- class Agent: # ... (other methods) ... # Agent has a telemetry object initialized in __init__ # self.telemetry = ProductTelemetry() async def run(self, max_steps: int = 100) -> AgentHistoryList: # ---> Tell Telemetry: Agent run is starting <--- self._log_agent_run() # This includes a telemetry.capture() call try: # ... (main agent loop runs here) ... for step_num in range(max_steps): # ... (agent takes steps) ... if self.state.history.is_done(): break # ... finally: # ---> Tell Telemetry: Agent run is ending <--- self.telemetry.capture( AgentEndTelemetryEvent( # Uses a specific data structure agent_id=self.state.agent_id, is_done=self.state.history.is_done(), success=self.state.history.is_successful(), # ... other anonymous stats ... ) ) # ... (cleanup browser etc.) ... return self.state.history ``` **Explanation:** 1. When the `Agent` is created, it gets an instance of `ProductTelemetry`. 2. Inside the `run` method, before the main loop starts, `_log_agent_run()` is called, which internally uses `self.telemetry.capture()` to send an `AgentRunTelemetryEvent`. 3. After the loop finishes (or an error occurs), the `finally` block ensures that another `self.telemetry.capture()` call is made, this time sending an `AgentEndTelemetryEvent` with summary statistics about the run. Similarly, the `Agent.step` method captures an `AgentStepTelemetryEvent`, and the `Controller`'s `Registry` captures a `ControllerRegisteredFunctionsTelemetryEvent` when it's initialized. This happens automatically in the background if telemetry is enabled. ## How to Disable Telemetry If you prefer not to send any anonymous usage data, you can easily disable the Telemetry Service. Set the environment variable `ANONYMIZED_TELEMETRY` to `False`. How you set environment variables depends on your operating system: * **Linux/macOS (in terminal):** ```bash export ANONYMIZED_TELEMETRY=False # Now run your Python script in the same terminal python your_agent_script.py ``` * **Windows (Command Prompt):** ```cmd set ANONYMIZED_TELEMETRY=False python your_agent_script.py ``` * **Windows (PowerShell):** ```powershell $env:ANONYMIZED_TELEMETRY="False" python your_agent_script.py ``` * **In Python Code (using `os` module, *before* importing `browser_use`):** ```python import os os.environ['ANONYMIZED_TELEMETRY'] = 'False' # Now import and use browser_use from browser_use import Agent # ... other imports # ... rest of your script ... ``` If this environment variable is set to `False`, the `ProductTelemetry` service will be initialized in a disabled state, and no data will be collected or sent. ## How It Works Under the Hood: Sending Anonymous Data When telemetry is enabled and an event occurs (like `agent.run()` starting): 1. **Component Calls Capture:** The `Agent` (or `Controller`) calls `telemetry.capture(event_data)`. 2. **Telemetry Service Checks:** The `ProductTelemetry` service checks if it's enabled. If not, it does nothing. 3. **Get User ID:** It retrieves or generates a unique, anonymous user ID. This is typically a random UUID (like `a1b2c3d4-e5f6-7890-abcd-ef1234567890`) stored in a hidden file on your computer (`~/.cache/browser_use/telemetry_user_id`). This ID helps group events from the same installation without identifying the actual user. 4. **Send to PostHog:** It sends the event data (structured using Pydantic models like `AgentRunTelemetryEvent`) along with the anonymous user ID to PostHog, a third-party service specialized in product analytics. 5. **Analysis:** Developers can then look at aggregated, anonymous trends in PostHog (e.g., "What percentage of agent runs finish successfully?", "What are the most common errors?") to understand usage patterns and prioritize improvements. Here's a simplified diagram: ```mermaid sequenceDiagram participant Agent participant TelemetrySvc as ProductTelemetry participant LocalFile as ~/.cache/.../user_id participant PostHog Agent->>TelemetrySvc: capture(AgentRunEvent) Note over TelemetrySvc: Telemetry Enabled? Yes. TelemetrySvc->>LocalFile: Read existing User ID (or create new) LocalFile-->>TelemetrySvc: Anonymous User ID (UUID) Note over TelemetrySvc: Package Event + User ID TelemetrySvc->>PostHog: Send(EventData, UserID) PostHog-->>TelemetrySvc: Acknowledgment (Optional) ``` Let's look at the simplified code involved. **1. Initializing Telemetry (`telemetry/service.py`)** The service checks the environment variable during initialization. ```python # --- File: telemetry/service.py (Simplified __init__) --- import os import uuid import logging from pathlib import Path from posthog import Posthog # The library for the external service from browser_use.utils import singleton logger = logging.getLogger(__name__) @singleton # Ensures only one instance exists class ProductTelemetry: USER_ID_PATH = str(Path.home() / '.cache' / 'browser_use' / 'telemetry_user_id') # ... (API key constants) ... _curr_user_id = None def __init__(self) -> None: # Check the environment variable telemetry_disabled = os.getenv('ANONYMIZED_TELEMETRY', 'true').lower() == 'false' if telemetry_disabled: self._posthog_client = None # Telemetry is off logger.debug('Telemetry disabled by environment variable.') else: # Initialize the PostHog client if enabled self._posthog_client = Posthog(...) logger.info( 'Anonymized telemetry enabled.' # Inform the user ) # Optionally silence PostHog's own logs # ... # ... (other methods) ... ``` **2. Capturing an Event (`telemetry/service.py`)** The `capture` method sends the data if the client is active. ```python # --- File: telemetry/service.py (Simplified capture) --- # Assume BaseTelemetryEvent is the base Pydantic model for events from browser_use.telemetry.views import BaseTelemetryEvent class ProductTelemetry: # ... (init) ... def capture(self, event: BaseTelemetryEvent) -> None: # Do nothing if telemetry is disabled if self._posthog_client is None: return try: # Get the anonymous user ID (lazy loaded) anon_user_id = self.user_id # Send the event name and its properties (as a dictionary) self._posthog_client.capture( distinct_id=anon_user_id, event=event.name, # e.g., "agent_run" properties=event.properties # Data from the event model ) logger.debug(f'Telemetry event captured: {event.name}') except Exception as e: # Don't crash the main application if telemetry fails logger.error(f'Failed to send telemetry event {event.name}: {e}') @property def user_id(self) -> str: """Gets or creates the anonymous user ID.""" if self._curr_user_id: return self._curr_user_id try: # Check if the ID file exists id_file = Path(self.USER_ID_PATH) if not id_file.exists(): # Create directory and generate a new UUID if it doesn't exist id_file.parent.mkdir(parents=True, exist_ok=True) new_user_id = str(uuid.uuid4()) id_file.write_text(new_user_id) self._curr_user_id = new_user_id else: # Read the existing UUID from the file self._curr_user_id = id_file.read_text().strip() except Exception: # Fallback if file access fails self._curr_user_id = 'UNKNOWN_USER_ID' return self._curr_user_id ``` **3. Event Data Structures (`telemetry/views.py`)** Like other components, Telemetry uses Pydantic models to define the structure of the data being sent. ```python # --- File: telemetry/views.py (Simplified Event Example) --- from dataclasses import dataclass, asdict from typing import Any, Dict, Sequence # Base class for all telemetry events (conceptual) @dataclass class BaseTelemetryEvent: @property def name(self) -> str: raise NotImplementedError @property def properties(self) -> Dict[str, Any]: # Helper to convert the dataclass fields to a dictionary return {k: v for k, v in asdict(self).items() if k != 'name'} # Specific event for when an agent run starts @dataclass class AgentRunTelemetryEvent(BaseTelemetryEvent): agent_id: str # Anonymous ID for the specific agent instance use_vision: bool # Was vision enabled? task: str # The task description (anonymized/hashed in practice) model_name: str # Name of the LLM used chat_model_library: str # Library used for the LLM (e.g., ChatOpenAI) version: str # browser-use version source: str # How browser-use was installed (e.g., pip, git) name: str = 'agent_run' # The event name sent to PostHog # ... other event models like AgentEndTelemetryEvent, AgentStepTelemetryEvent ... ``` These structures ensure the data sent to PostHog is consistent and well-defined. ## Conclusion The **Telemetry Service** (`ProductTelemetry`) provides an optional and privacy-conscious way for the `Browser Use` project to gather anonymous feedback about how the tool is being used. It automatically captures events like agent runs, steps, and errors, sending anonymized statistics to developers via PostHog. This feedback loop is vital for identifying common issues, understanding feature usage, and ultimately improving the `Browser Use` library for everyone. Remember, you have full control and can easily disable this service by setting the `ANONYMIZED_TELEMETRY=False` environment variable. This chapter concludes our tour of the core components within the `Browser Use` project. You've learned about the [Agent](01_agent.md), the guiding [System Prompt](02_system_prompt.md), the isolated [BrowserContext](03_browsercontext.md), the webpage map ([DOM Representation](04_dom_representation.md)), the action execution engine ([Action Controller & Registry](05_action_controller___registry.md)), the conversation tracker ([Message Manager](06_message_manager.md)), the data blueprints ([Data Structures (Views)](07_data_structures__views_.md)), and now the optional feedback mechanism ([Telemetry Service](08_telemetry_service.md)). We hope this gives you a solid foundation for understanding and using `Browser Use`! --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Browser Use/index.md ================================================ --- layout: default title: "Browser Use" nav_order: 4 has_children: true --- # Tutorial: Browser Use > This tutorial is AI-generated! To learn more, check out [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) **Browser Use**[View Repo](https://github.com/browser-use/browser-use/tree/3076ba0e83f30b45971af58fe2aeff64472da812/browser_use) is a project that allows an *AI agent* to control a web browser and perform tasks automatically. Think of it like an AI assistant that can browse websites, fill forms, click buttons, and extract information based on your instructions. It uses a Large Language Model (LLM) as its "brain" to decide what actions to take on a webpage to complete a given *task*. The project manages the browser session, understands the page structure (DOM), and communicates back and forth with the LLM. ```mermaid flowchart TD A0["Agent"] A1["BrowserContext"] A2["Action Controller & Registry"] A3["DOM Representation"] A4["Message Manager"] A5["System Prompt"] A6["Data Structures (Views)"] A7["Telemetry Service"] A0 -- "Gets state from" --> A1 A0 -- "Uses to execute actions" --> A2 A0 -- "Uses for LLM communication" --> A4 A0 -- "Gets instructions from" --> A5 A0 -- "Uses/Produces data formats" --> A6 A0 -- "Logs events to" --> A7 A1 -- "Gets DOM structure via" --> A3 A1 -- "Provides BrowserState" --> A6 A2 -- "Executes actions on" --> A1 A2 -- "Defines/Uses ActionModel/Ac..." --> A6 A2 -- "Logs registered functions to" --> A7 A3 -- "Provides structure to" --> A1 A3 -- "Uses DOM structures" --> A6 A4 -- "Provides messages to" --> A0 A4 -- "Initializes with" --> A5 A4 -- "Formats data using" --> A6 A5 -- "Defines structure for Agent..." --> A6 A7 -- "Receives events from" --> A0 ``` ================================================ FILE: docs/Celery/01_celery_app.md ================================================ --- layout: default title: "Celery App" parent: "Celery" nav_order: 1 --- # Chapter 1: The Celery App - Your Task Headquarters Welcome to the world of Celery! If you've ever thought, "I wish this slow part of my web request could run somewhere else later," or "How can I process this huge amount of data without freezing my main application?", then Celery is here to help. Celery allows you to run code (we call these "tasks") separately from your main application, either in the background on the same machine or distributed across many different machines. But how do you tell Celery *what* tasks to run and *how* to run them? That's where the **Celery App** comes in. ## What Problem Does the Celery App Solve? Imagine you're building a website. When a user uploads a profile picture, you need to resize it into different formats (thumbnail, medium, large). Doing this immediately when the user clicks "upload" can make the request slow and keep the user waiting. Ideally, you want to: 1. Quickly save the original image. 2. Tell the user "Okay, got it!" 3. *Later*, in the background, resize the image. Celery helps with step 3. But you need a central place to define the "resize image" task and configure *how* it should be run (e.g., where to send the request to resize, where to store the result). The **Celery App** is that central place. Think of it like the main application object in web frameworks like Flask or Django. It's the starting point, the brain, the headquarters for everything Celery-related in your project. ## Creating Your First Celery App Getting started is simple. You just need to create an instance of the `Celery` class. Let's create a file named `celery_app.py`: ```python # celery_app.py from celery import Celery # Create a Celery app instance # 'tasks' is just a name for this app instance, often the module name. # 'broker' tells Celery where to send task messages. # We'll use Redis here for simplicity (you need Redis running). app = Celery('tasks', broker='redis://localhost:6379/0', backend='redis://localhost:6379/0') # Added backend for results print(f"Celery app created: {app}") ``` **Explanation:** * `from celery import Celery`: We import the main `Celery` class. * `app = Celery(...)`: We create an instance. * `'tasks'`: This is the *name* of our Celery application. It's often good practice to use the name of the module where your app is defined. Celery uses this name to automatically name tasks if you don't provide one explicitly. * `broker='redis://localhost:6379/0'`: This is crucial! It tells Celery where to send the task messages. A "broker" is like a post office for tasks. We're using Redis here, but Celery supports others like RabbitMQ. We'll learn more about the [Broker Connection (AMQP)](04_broker_connection__amqp_.md) in Chapter 4. (Note: AMQP is the protocol often used with brokers like RabbitMQ, but the concept applies even when using Redis). * `backend='redis://localhost:6379/0'`: This tells Celery where to store the results of your tasks. If your task returns a value (like `2+2` returns `4`), Celery can store this `4` in the backend. We'll cover the [Result Backend](06_result_backend.md) in Chapter 6. That's it! You now have a `Celery` application instance named `app`. This `app` object is your main tool for working with Celery. ## Defining a Task with the App Now that we have our `app`, how do we define a task? We use the `@app.task` decorator. Let's modify `celery_app.py`: ```python # celery_app.py from celery import Celery import time # Create a Celery app instance app = Celery('tasks', broker='redis://localhost:6379/0', backend='redis://localhost:6379/0') # Define a simple task using the app's decorator @app.task def add(x, y): print(f"Task 'add' started with args: ({x}, {y})") time.sleep(2) # Simulate some work result = x + y print(f"Task 'add' finished with result: {result}") return result print(f"Task 'add' is registered: {app.tasks.get('celery_app.add')}") ``` **Explanation:** * `@app.task`: This is the magic decorator. It takes our regular Python function `add(x, y)` and registers it as a Celery task within our `app`. * Now, `app` knows about a task called `celery_app.add` (Celery automatically generates the name based on the module `celery_app` and function `add`). * We'll learn all about [Task](03_task.md)s in Chapter 3. ## Sending a Task (Conceptual) How do we actually *run* this `add` task in the background? We use methods like `.delay()` or `.apply_async()` on the task object itself. ```python # In a separate Python script or interpreter, after importing 'add' from celery_app.py from celery_app import add # Send the task to the broker configured in our 'app' result_promise = add.delay(4, 5) print(f"Task sent! It will run in the background.") print(f"We got back a promise object: {result_promise}") # We can later check the result using result_promise.get() # (Requires a result backend and a worker running the task) ``` **Explanation:** * `add.delay(4, 5)`: This doesn't run the `add` function *right now*. Instead, it: 1. Packages the task name (`celery_app.add`) and its arguments (`4`, `5`) into a message. 2. Sends this message to the **broker** (Redis, in our case) that was configured in our `Celery` app instance (`app`). * It returns an `AsyncResult` object (our `result_promise`), which is like an IOU or a placeholder for the actual result. We can use this later to check if the task finished and what its result was (if we configured a [Result Backend](06_result_backend.md)). A separate program, called a Celery [Worker](05_worker.md), needs to be running. This worker watches the broker for new task messages, executes the corresponding task function, and (optionally) stores the result in the backend. We'll learn how to run a worker in Chapter 5. The key takeaway here is that the **Celery App** holds the configuration needed (`broker` and `backend` URLs) for `add.delay()` to know *where* to send the task message and potentially where the result will be stored. ## How It Works Internally (High-Level) Let's visualize the process of creating the app and sending a task: 1. **Initialization (`Celery(...)`)**: When you create `app = Celery(...)`, the app instance stores the `broker` and `backend` URLs and sets up internal components like the task registry. 2. **Task Definition (`@app.task`)**: The decorator tells the `app` instance: "Hey, remember this function `add`? It's a task." The app stores this information in its internal task registry (`app.tasks`). 3. **Sending a Task (`add.delay(4, 5)`)**: * `add.delay()` looks up the `app` it belongs to. * It asks the `app` for the `broker` URL. * It creates a message containing the task name (`celery_app.add`), arguments (`4, 5`), and other details. * It uses the `broker` URL to connect to the broker (Redis) and sends the message. ```mermaid sequenceDiagram participant Client as Your Python Code participant CeleryApp as app = Celery(...) participant AddTask as @app.task add() participant Broker as Redis/RabbitMQ Client->>CeleryApp: Create instance (broker='redis://...') Client->>AddTask: Define add() function with @app.task Note over AddTask,CeleryApp: Decorator registers 'add' with 'app' Client->>AddTask: Call add.delay(4, 5) AddTask->>CeleryApp: Get broker configuration CeleryApp-->>AddTask: 'redis://...' AddTask->>Broker: Send task message ('add', 4, 5) Broker-->>AddTask: Acknowledgment (message sent) AddTask-->>Client: Return AsyncResult (promise) ``` This diagram shows how the `Celery App` acts as the central coordinator, holding configuration and enabling the task (`add`) to send its execution request to the Broker. ## Code Dive: Inside the `Celery` Class Let's peek at some relevant code snippets (simplified for clarity). **Initialization (`app/base.py`)** When you call `Celery(...)`, the `__init__` method runs: ```python # Simplified from celery/app/base.py from .registry import TaskRegistry from .utils import Settings class Celery: def __init__(self, main=None, broker=None, backend=None, include=None, config_source=None, task_cls=None, autofinalize=True, **kwargs): self.main = main # Store the app name ('tasks' in our example) self._tasks = TaskRegistry({}) # Create an empty dictionary for tasks # Store broker/backend/include settings temporarily self._preconf = {} self.__autoset('broker_url', broker) self.__autoset('result_backend', backend) self.__autoset('include', include) # ... other kwargs ... # Configuration object - initially pending, loaded later self._conf = Settings(...) # ... other setup ... _register_app(self) # Register this app instance globally (sometimes useful) # Helper to store initial settings before full configuration load def __autoset(self, key, value): if value is not None: self._preconf[key] = value ``` This shows how the `Celery` object is initialized, storing the name, setting up a task registry, and holding onto initial configuration like the `broker` URL. The full configuration is often loaded later (see [Configuration](02_configuration.md)). **Task Decorator (`app/base.py`)** The `@app.task` decorator ultimately calls `_task_from_fun`: ```python # Simplified from celery/app/base.py def task(self, *args, **opts): # ... logic to handle decorator arguments ... def _create_task_cls(fun): # If app isn't finalized, might return a proxy object first # Eventually calls _task_from_fun to create/register the task ret = self._task_from_fun(fun, **opts) return ret return _create_task_cls def _task_from_fun(self, fun, name=None, base=None, bind=False, **options): # Generate task name if not provided (e.g., 'celery_app.add') name = name or self.gen_task_name(fun.__name__, fun.__module__) base = base or self.Task # Default base Task class # Check if task already registered if name not in self._tasks: # Create a Task class dynamically based on the function task = type(fun.__name__, (base,), { 'app': self, # Link task back to this app instance! 'name': name, 'run': staticmethod(fun), # The actual function to run # ... other attributes and options ... })() # Instantiate the new task class self._tasks[task.name] = task # Add to app's task registry task.bind(self) # Perform any binding steps else: task = self._tasks[name] # Task already exists return task ``` This shows how the decorator uses the `app` instance (`self`) to generate a name, create a `Task` object wrapping your function, associate the task with the app (`'app': self`), and store it in the `app._tasks` registry. **Sending Tasks (`app/base.py`)** Calling `.delay()` or `.apply_async()` eventually uses `app.send_task`: ```python # Simplified from celery/app/base.py def send_task(self, name, args=None, kwargs=None, task_id=None, producer=None, connection=None, router=None, **options): # ... lots of logic to prepare options, task_id, routing ... # Get the routing info (exchange, routing_key, queue) # Uses app.conf for defaults if not specified options = self.amqp.router.route(options, name, args, kwargs) # Create the message body message = self.amqp.create_task_message( task_id or uuid(), # Generate task ID if needed name, args, kwargs, # Task details # ... other arguments like countdown, eta, expires ... ) # Get a producer (handles connection/channel to broker) # Uses the app's producer pool (app.producer_pool) with self.producer_or_acquire(producer) as P: # Tell the backend we're about to send (if tracking results) if not options.get('ignore_result', False): self.backend.on_task_call(P, task_id) # Actually send the message via the producer self.amqp.send_task_message(P, name, message, **options) # Create the AsyncResult object to return to the caller result = self.AsyncResult(task_id) # ... set result properties ... return result ``` This highlights how `send_task` relies on the `app` (via `self`) to: * Access configuration (`self.conf`). * Use the AMQP utilities (`self.amqp`) for routing and message creation. * Access the result backend (`self.backend`). * Get a connection/producer from the pool (`self.producer_or_acquire`). * Create the `AsyncResult` using the app's result class (`self.AsyncResult`). ## Conclusion You've learned that the `Celery App` is the essential starting point for any Celery project. * It acts as the central **headquarters** or **brain**. * You create it using `app = Celery(...)`, providing at least a name and a `broker` URL. * It holds **configuration** (like broker/backend URLs). * It **registers tasks** defined using the `@app.task` decorator. * It enables tasks to be **sent** to the broker using methods like `.delay()`. The app ties everything together. But how do you manage all the different settings Celery offers, beyond just the `broker` and `backend`? In the next chapter, we'll dive deeper into how to configure your Celery app effectively. **Next:** [Chapter 2: Configuration](02_configuration.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/02_configuration.md ================================================ --- layout: default title: "Configuration" parent: "Celery" nav_order: 2 --- # Chapter 2: Configuration - Telling Celery How to Work In [Chapter 1: The Celery App](01_celery_app.md), we created our first `Celery` app instance. We gave it a name and told it where our message broker and result backend were located using the `broker` and `backend` arguments: ```python # From Chapter 1 from celery import Celery app = Celery('tasks', broker='redis://localhost:6379/0', backend='redis://localhost:6379/0') ``` This worked, but what if we want to change settings later, or manage many different settings? Passing everything directly when creating the `app` can become messy. ## What Problem Does Configuration Solve? Think of Celery as a busy workshop with different stations (workers, schedulers) and tools (message brokers, result storage). **Configuration** is the central instruction manual or settings panel for this entire workshop. It tells Celery things like: * **Where is the message broker?** (The post office for tasks) * **Where should results be stored?** (The filing cabinet for completed work) * **How should tasks be handled?** (e.g., What format should the messages use? Are there any speed limits for certain tasks?) * **How should the workers behave?** (e.g., How many tasks can they work on at once?) * **How should scheduled tasks run?** (e.g., What timezone should be used?) Without configuration, Celery wouldn't know how to connect to your broker, where to put results, or how to manage the workflow. Configuration allows you to customize Celery to fit your specific needs. ## Key Configuration Concepts While Celery has many settings, here are some fundamental ones you'll encounter often: 1. **`broker_url`**: The address of your message broker (like Redis or RabbitMQ). This is essential for sending and receiving task messages. We'll learn more about brokers in [Chapter 4: Broker Connection (AMQP)](04_broker_connection__amqp_.md). 2. **`result_backend`**: The address of your result store. This is needed if you want to keep track of task status or retrieve return values. We cover this in [Chapter 6: Result Backend](06_result_backend.md). 3. **`include`**: A list of module names that the Celery worker should import when it starts. This is often where your task definitions live (like the `add` task from Chapter 1). 4. **`task_serializer`**: Defines the format used to package task messages before sending them to the broker (e.g., 'json', 'pickle'). 'json' is a safe and common default. 5. **`timezone`**: Sets the timezone Celery uses, which is important for scheduled tasks managed by [Chapter 7: Beat (Scheduler)](07_beat__scheduler_.md). ## How to Configure Your Celery App Celery is flexible and offers several ways to set its configuration. **Method 1: Directly on the App Object (After Creation)** You can update the configuration *after* creating the `Celery` app instance using the `app.conf.update()` method. This is handy for simple adjustments or quick tests. ```python # celery_app.py from celery import Celery # Create the app (maybe with initial settings) app = Celery('tasks', broker='redis://localhost:6379/0') # Update configuration afterwards app.conf.update( result_backend='redis://localhost:6379/1', # Use database 1 for results task_serializer='json', result_serializer='json', accept_content=['json'], # Only accept json formatted tasks timezone='Europe/Oslo', enable_utc=True, # Use UTC timezone internally # Add task modules to import when worker starts include=['my_tasks'] # Assumes you have a file my_tasks.py with tasks ) print(f"Broker URL set to: {app.conf.broker_url}") print(f"Result backend set to: {app.conf.result_backend}") print(f"Timezone set to: {app.conf.timezone}") ``` **Explanation:** * We create the `app` like before, potentially setting some initial config like the `broker`. * `app.conf.update(...)`: We pass a Python dictionary to this method. The keys are Celery setting names (like `result_backend`, `timezone`), and the values are what we want to set them to. * `app.conf` is the central configuration object attached to your `app` instance. **Method 2: Dedicated Configuration Module (Recommended)** For most projects, especially larger ones, it's cleaner to keep your Celery settings in a separate Python file (e.g., `celeryconfig.py`). 1. **Create `celeryconfig.py`:** ```python # celeryconfig.py # Broker settings broker_url = 'redis://localhost:6379/0' # Result backend settings result_backend = 'redis://localhost:6379/1' # Task settings task_serializer = 'json' result_serializer = 'json' accept_content = ['json'] # Timezone settings timezone = 'America/New_York' enable_utc = True # Recommended # List of modules to import when the Celery worker starts. imports = ('proj.tasks',) # Example: Assuming tasks are in proj/tasks.py ``` **Explanation:** * This is just a standard Python file. * We define variables whose names match the Celery configuration settings (e.g., `broker_url`, `timezone`). Celery expects these specific names. 2. **Load the configuration in your app file (`celery_app.py`):** ```python # celery_app.py from celery import Celery # Create the app instance (no need to pass broker/backend here now) app = Celery('tasks') # Load configuration from the 'celeryconfig' module # Assumes celeryconfig.py is in the same directory or Python path app.config_from_object('celeryconfig') print(f"Loaded Broker URL from config file: {app.conf.broker_url}") print(f"Loaded Timezone from config file: {app.conf.timezone}") # You might still define tasks in this file or in the modules listed # in celeryconfig.imports @app.task def multiply(x, y): return x * y ``` **Explanation:** * `app = Celery('tasks')`: We create the app instance, but we don't need to specify the broker or backend here because they will be loaded from the file. * `app.config_from_object('celeryconfig')`: This is the key line. It tells Celery to: * Find a module named `celeryconfig`. * Look at all the uppercase variables defined in that module. * Use those variables to configure the `app`. This approach keeps your settings organized and separate from your application logic. **Method 3: Environment Variables** Celery settings can also be controlled via environment variables. This is very useful for deployments (e.g., using Docker) where you might want to change the broker address without changing code. Environment variable names typically follow the pattern `CELERY_`. For example, you could set the broker URL in your terminal before running your app or worker: ```bash # In your terminal (Linux/macOS) export CELERY_BROKER_URL='amqp://guest:guest@localhost:5672//' export CELERY_RESULT_BACKEND='redis://localhost:6379/2' # Now run your Python script or Celery worker python your_script.py # or # celery -A your_app_module worker --loglevel=info ``` Celery automatically picks up these environment variables. They often take precedence over settings defined in a configuration file or directly on the app, making them ideal for overriding settings in different environments (development, staging, production). *Note: The exact precedence order can sometimes depend on how and when configuration is loaded, but environment variables are generally a high-priority source.* ## How It Works Internally (Simplified View) 1. **Loading:** When you create a `Celery` app or call `app.config_from_object()`, Celery reads the settings from the specified source (arguments, object/module, environment variables). 2. **Storing:** These settings are stored in a dictionary-like object accessible via `app.conf`. Celery uses a default set of values initially, which are then updated or overridden by your configuration. 3. **Accessing:** When a Celery component needs a setting (e.g., the worker needs the `broker_url` to connect, or a task needs the `task_serializer`), it simply looks up the required key in the `app.conf` object. ```mermaid sequenceDiagram participant ClientCode as Your App Setup (e.g., celery_app.py) participant CeleryApp as app = Celery(...) participant ConfigSource as celeryconfig.py / Env Vars participant Worker as Celery Worker Process participant Broker as Message Broker (e.g., Redis) ClientCode->>CeleryApp: Create instance ClientCode->>CeleryApp: app.config_from_object('celeryconfig') CeleryApp->>ConfigSource: Read settings (broker_url, etc.) ConfigSource-->>CeleryApp: Return settings values Note over CeleryApp: Stores settings in app.conf Worker->>CeleryApp: Start worker for 'app' Worker->>CeleryApp: Access app.conf.broker_url CeleryApp-->>Worker: Return 'redis://localhost:6379/0' Worker->>Broker: Connect using 'redis://localhost:6379/0' ``` This diagram shows the app loading configuration first, and then the worker using that stored configuration (`app.conf`) to perform its duties, like connecting to the broker. ## Code Dive: Where Configuration Lives * **`app.conf`:** This is the primary interface you interact with. It's an instance of a special dictionary-like class (`celery.app.utils.Settings`) that handles loading defaults, converting keys (Celery has changed setting names over time), and providing convenient access. You saw this in the direct update example: `app.conf.update(...)`. * **Loading Logic (`config_from_object`)**: Methods like `app.config_from_object` typically delegate to the app's "loader" (`app.loader`). The loader (e.g., `celery.loaders.base.BaseLoader` or `celery.loaders.app.AppLoader`) handles the actual importing of the configuration module and extracting the settings. See `loaders/base.py` for the `config_from_object` method definition. * **Default Settings**: Celery has a built-in set of default values for all its settings. These are defined in `celery.app.defaults`. Your configuration overrides these defaults. See `app/defaults.py`. * **Accessing Settings**: Throughout the Celery codebase, different components access the configuration via `app.conf`. For instance, when sending a task (`app/base.py:send_task`), the code looks up `app.conf.broker_url` (or related settings) to know where and how to send the message. ```python # Simplified concept from loaders/base.py class BaseLoader: # ... def config_from_object(self, obj, silent=False): if isinstance(obj, str): # Import the module (e.g., 'celeryconfig') obj = self._smart_import(obj, imp=self.import_from_cwd) # ... error handling ... # Store the configuration (simplified - actual process merges) self._conf = force_mapping(obj) # Treat obj like a dictionary # ... return True # Simplified concept from app/base.py (where settings are used) class Celery: # ... def send_task(self, name, args=None, kwargs=None, **options): # ... other setup ... # Access configuration to know where the broker is broker_connection_url = self.conf.broker_url # Reads from app.conf # Use the broker URL to get a connection/producer with self.producer_or_acquire(producer) as P: # ... create message ... # Send message using the connection derived from broker_url self.amqp.send_task_message(P, name, message, **options) # ... return result object ... ``` This illustrates the core idea: load configuration into `app.conf`, then components read from `app.conf` when they need instructions. ## Conclusion Configuration is the backbone of Celery's flexibility. You've learned: * **Why it's needed:** To tell Celery *how* to operate (broker, backend, tasks settings). * **What can be configured:** Broker/backend URLs, serializers, timezones, task imports, and much more. * **How to configure:** * Directly via `app.conf.update()`. * Using a dedicated module (`celeryconfig.py`) with `app.config_from_object()`. (Recommended) * Using environment variables (great for deployment). * **How it works:** Settings are loaded into `app.conf` and accessed by Celery components as needed. With your Celery app configured, you're ready to define the actual work you want Celery to do. That's where Tasks come in! **Next:** [Chapter 3: Task](03_task.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/03_task.md ================================================ --- layout: default title: "Task" parent: "Celery" nav_order: 3 --- # Chapter 3: Task - The Job Description In [Chapter 1: The Celery App](01_celery_app.md), we set up our Celery headquarters, and in [Chapter 2: Configuration](02_configuration.md), we learned how to give it instructions. Now, we need to define the *actual work* we want Celery to do. This is where **Tasks** come in. ## What Problem Does a Task Solve? Imagine you have a specific job that needs doing, like "Resize this image to thumbnail size" or "Send a welcome email to this new user." In Celery, each of these specific jobs is represented by a **Task**. A Task is like a **job description** or a **recipe**. It contains the exact steps (the code) needed to complete a specific piece of work. You write this recipe once as a Python function, and then you can tell Celery to follow that recipe whenever you need that job done, potentially many times with different inputs (like resizing different images or sending emails to different users). The key benefit is that you don't run the recipe immediately yourself. You hand the recipe (the Task) and the ingredients (the arguments, like the image file or the user's email) over to Celery. Celery then finds an available helper (a [Worker](05_worker.md)) who knows how to follow that specific recipe and lets them do the work in the background. This keeps your main application free to do other things. ## Defining Your First Task Defining a task in Celery is surprisingly simple. You just take a regular Python function and "decorate" it using `@app.task`. Remember our `app` object from [Chapter 1](01_celery_app.md)? We use its `task` decorator. Let's create a file, perhaps named `tasks.py`, to hold our task definitions: ```python # tasks.py import time from celery_app import app # Import the app instance we created @app.task def add(x, y): """A simple task that adds two numbers.""" print(f"Task 'add' starting with ({x}, {y})") # Simulate some work taking time time.sleep(5) result = x + y print(f"Task 'add' finished with result: {result}") return result @app.task def send_welcome_email(user_id): """A task simulating sending a welcome email.""" print(f"Task 'send_welcome_email' starting for user {user_id}") # Simulate email sending process time.sleep(3) print(f"Welcome email supposedly sent to user {user_id}") return f"Email sent to {user_id}" # You can have many tasks in one file! ``` **Explanation:** 1. **`from celery_app import app`**: We import the `Celery` app instance we configured earlier. This instance holds the knowledge about our broker and backend. 2. **`@app.task`**: This is the magic decorator! When Celery sees this above a function (`add` or `send_welcome_email`), it says, "Ah! This isn't just a regular function; it's a job description that my workers need to know about." 3. **The Function (`add`, `send_welcome_email`)**: This is the actual Python code that performs the work. It's the core of the task – the steps in the recipe. It can take arguments (like `x`, `y`, or `user_id`) and can return a value. 4. **Registration**: The `@app.task` decorator automatically *registers* this function with our Celery `app`. Now, `app` knows about a task named `tasks.add` and another named `tasks.send_welcome_email` (Celery creates the name from `module_name.function_name`). Workers connected to this `app` will be able to find and execute this code when requested. *Self-Host Note:* If you are running this code, make sure you have a `celery_app.py` file containing your Celery app instance as shown in previous chapters, and that the `tasks.py` file can import `app` from it. ## Sending a Task for Execution Okay, we've written our recipes (`add` and `send_welcome_email`). How do we tell Celery, "Please run the `add` recipe with the numbers 5 and 7"? We **don't call the function directly** like `add(5, 7)`. If we did that, it would just run immediately in our current program, which defeats the purpose of using Celery! Instead, we use special methods on the task object itself, most commonly `.delay()` or `.apply_async()`. Let's try this in a separate Python script or an interactive Python session: ```python # run_tasks.py from tasks import add, send_welcome_email print("Let's send some tasks!") # --- Using .delay() --- # Tell Celery to run add(5, 7) in the background result_promise_add = add.delay(5, 7) print(f"Sent task add(5, 7). Task ID: {result_promise_add.id}") # Tell Celery to run send_welcome_email(123) in the background result_promise_email = send_welcome_email.delay(123) print(f"Sent task send_welcome_email(123). Task ID: {result_promise_email.id}") # --- Using .apply_async() --- # Does the same thing as .delay() but allows more options result_promise_add_later = add.apply_async(args=(10, 20), countdown=10) # Run after 10s print(f"Sent task add(10, 20) to run in 10s. Task ID: {result_promise_add_later.id}") print("Tasks have been sent to the broker!") print("A Celery worker needs to be running to pick them up.") ``` **Explanation:** 1. **`from tasks import add, send_welcome_email`**: We import our *task functions*. Because they were decorated with `@app.task`, they are now special Celery Task objects. 2. **`add.delay(5, 7)`**: This is the simplest way to send a task. * It *doesn't* run `add(5, 7)` right now. * It takes the arguments `(5, 7)`. * It packages them up into a **message** along with the task's name (`tasks.add`). * It sends this message to the **message broker** (like Redis or RabbitMQ) that we configured in our `celery_app.py`. Think of it like dropping a request slip into a mailbox. 3. **`send_welcome_email.delay(123)`**: Same idea, but for our email task. A message with `tasks.send_welcome_email` and the argument `123` is sent to the broker. 4. **`add.apply_async(args=(10, 20), countdown=10)`**: This is a more powerful way to send tasks. * It does the same fundamental thing: sends a message to the broker. * It allows for more options, like `args` (positional arguments as a tuple), `kwargs` (keyword arguments as a dict), `countdown` (delay execution by seconds), `eta` (run at a specific future time), and many others. * `.delay(*args, **kwargs)` is just a convenient shortcut for `.apply_async(args=args, kwargs=kwargs)`. 5. **`result_promise_... = ...`**: Both `.delay()` and `apply_async()` return an `AsyncResult` object immediately. This is *not* the actual result of the task (like `12` for `add(5, 7)`). It's more like a receipt or a tracking number (notice the `.id` attribute). You can use this object later to check if the task finished and what its result was, but only if you've set up a [Result Backend](06_result_backend.md) (Chapter 6). 6. **The Worker**: Sending the task only puts the message on the queue. A separate process, the Celery [Worker](05_worker.md) (Chapter 5), needs to be running. The worker constantly watches the queue, picks up messages, finds the corresponding task function (using the name like `tasks.add`), and executes it with the provided arguments. ## How It Works Internally (Simplified) Let's trace the journey of defining and sending our `add` task: 1. **Definition (`@app.task` in `tasks.py`)**: * Python defines the `add` function. * The `@app.task` decorator sees this function. * It tells the `Celery` instance (`app`) about this function, registering it under the name `tasks.add` in an internal dictionary (`app.tasks`). The `app` instance knows the broker/backend settings. 2. **Sending (`add.delay(5, 7)` in `run_tasks.py`)**: * You call `.delay()` on the `add` task object. * `.delay()` (or `.apply_async()`) internally uses the `app` the task is bound to. * It asks the `app` for the configured broker URL. * It creates a message containing: * Task Name: `tasks.add` * Arguments: `(5, 7)` * Other options (like a unique Task ID). * It connects to the **Broker** (e.g., Redis) using the broker URL. * It sends the message to a specific queue (usually named 'celery' by default) on the broker. * It returns an `AsyncResult` object referencing the Task ID. 3. **Waiting**: The message sits in the queue on the broker, waiting. 4. **Execution (by a [Worker](05_worker.md))**: * A separate Celery Worker process is running, connected to the same broker and `app`. * The Worker fetches the message from the queue. * It reads the task name: `tasks.add`. * It looks up `tasks.add` in its copy of the `app.tasks` registry to find the actual `add` function code. * It calls the `add` function with the arguments from the message: `add(5, 7)`. * The function runs (prints logs, sleeps, calculates `12`). * If a [Result Backend](06_result_backend.md) is configured, the Worker takes the return value (`12`) and stores it in the backend, associated with the Task ID. * The Worker acknowledges the message to the broker, removing it from the queue. ```mermaid sequenceDiagram participant Client as Your Code (run_tasks.py) participant TaskDef as @app.task def add() participant App as Celery App Instance participant Broker as Message Broker (e.g., Redis) participant Worker as Celery Worker (separate process) Note over TaskDef, App: 1. @app.task registers 'add' function with App's task registry Client->>TaskDef: 2. Call add.delay(5, 7) TaskDef->>App: 3. Get broker config App-->>TaskDef: Broker URL TaskDef->>Broker: 4. Send message ('tasks.add', (5, 7), task_id, ...) Broker-->>TaskDef: Ack (Message Queued) TaskDef-->>Client: 5. Return AsyncResult(task_id) Worker->>Broker: 6. Fetch next message Broker-->>Worker: Message ('tasks.add', (5, 7), task_id) Worker->>App: 7. Lookup 'tasks.add' in registry App-->>Worker: add function code Worker->>Worker: 8. Execute add(5, 7) -> returns 12 Note over Worker: (Optionally store result in Backend) Worker->>Broker: 9. Acknowledge message completion ``` ## Code Dive: Task Creation and Sending * **Task Definition (`@app.task`)**: This decorator is defined in `celery/app/base.py` within the `Celery` class method `task`. It ultimately calls `_task_from_fun`. ```python # Simplified from celery/app/base.py class Celery: # ... def task(self, *args, **opts): # ... handles decorator arguments ... def _create_task_cls(fun): # Returns a Task instance or a Proxy that creates one later ret = self._task_from_fun(fun, **opts) return ret return _create_task_cls def _task_from_fun(self, fun, name=None, base=None, bind=False, **options): # Generate name like 'tasks.add' if not given name = name or self.gen_task_name(fun.__name__, fun.__module__) base = base or self.Task # The base Task class (from celery.app.task) if name not in self._tasks: # If not already registered... # Dynamically create a Task class wrapping the function task = type(fun.__name__, (base,), { 'app': self, # Link task back to this app instance! 'name': name, 'run': staticmethod(fun), # The actual function to run '__doc__': fun.__doc__, '__module__': fun.__module__, # ... other options ... })() # Instantiate the new Task class self._tasks[task.name] = task # Add to app's registry! task.bind(self) # Perform binding steps else: task = self._tasks[name] # Task already exists return task ``` This shows how the decorator essentially creates a specialized object (an instance of a class derived from `celery.app.task.Task`) that wraps your original function and registers it with the `app` under a specific name. * **Task Sending (`.delay`)**: The `.delay()` method is defined on the `Task` class itself in `celery/app/task.py`. It's a simple shortcut. ```python # Simplified from celery/app/task.py class Task: # ... def delay(self, *args, **kwargs): """Shortcut for apply_async(args, kwargs)""" return self.apply_async(args, kwargs) def apply_async(self, args=None, kwargs=None, ..., **options): # ... argument checking, option processing ... # Get the app associated with this task instance app = self._get_app() # If always_eager is set, run locally instead of sending if app.conf.task_always_eager: return self.apply(args, kwargs, ...) # Runs inline # The main path: tell the app to send the task message return app.send_task( self.name, args, kwargs, task_type=self, **options # Includes things like countdown, eta, queue etc. ) ``` You can see how `.delay` just calls `.apply_async`, which then (usually) delegates the actual message sending to the `app.send_task` method we saw briefly in [Chapter 1](01_celery_app.md). The `app` uses its configuration to know *how* and *where* to send the message. ## Conclusion You've learned the core concept of a Celery **Task**: * It represents a single, well-defined **unit of work** or **job description**. * You define a task by decorating a normal Python function with `@app.task`. This **registers** the task with your Celery application. * You **send** a task request (not run it directly) using `.delay()` or `.apply_async()`. * Sending a task puts a **message** onto a queue managed by a **message broker**. * A separate **Worker** process picks up the message and executes the corresponding task function. Tasks are the fundamental building blocks of work in Celery. Now that you know how to define a task and request its execution, let's look more closely at the crucial component that handles passing these requests around: the message broker. **Next:** [Chapter 4: Broker Connection (AMQP)](04_broker_connection__amqp_.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/04_broker_connection__amqp_.md ================================================ --- layout: default title: "Broker Connection (AMQP)" parent: "Celery" nav_order: 4 --- # Chapter 4: Broker Connection (AMQP) - Celery's Postal Service In [Chapter 3: Task](03_task.md), we learned how to define "job descriptions" (Tasks) like `add(x, y)` and how to request them using `.delay()`. But when you call `add.delay(2, 2)`, how does that request actually *get* to a worker process that can perform the addition? It doesn't just magically appear! This is where the **Broker Connection** comes in. Think of it as Celery's built-in postal service. ## What Problem Does the Broker Connection Solve? Imagine you want to send a letter (a task request) to a friend (a worker) who lives in another city. You can't just shout the message out your window and hope they hear it. You need: 1. A **Post Office** (the Message Broker, like RabbitMQ or Redis) that handles mail. 2. A way to **talk to the Post Office** (the Broker Connection) to drop off your letter or pick up mail addressed to you. The Broker Connection is that crucial link between your application (where you call `.delay()`) or your Celery worker and the message broker system. It manages sending messages *to* the broker and receiving messages *from* the broker reliably. Without this connection, your task requests would never leave your application, and your workers would never know there's work waiting for them. ## Key Concepts: Post Office & Rules Let's break down the pieces: 1. **The Message Broker (The Post Office):** This is a separate piece of software that acts as a central hub for messages. Common choices are RabbitMQ and Redis. You tell Celery its address using the `broker_url` setting in your [Configuration](02_configuration.md). ```python # From Chapter 2 - celeryconfig.py broker_url = 'amqp://guest:guest@localhost:5672//' # Example for RabbitMQ # Or maybe: broker_url = 'redis://localhost:6379/0' # Example for Redis ``` 2. **The Connection (Talking to the Staff):** This is the active communication channel established between your Python code (either your main app or a worker) and the broker. It's like having an open phone line to the post office. Celery, using a library called `kombu`, handles creating and managing these connections based on the `broker_url`. 3. **AMQP (The Postal Rules):** AMQP stands for **Advanced Message Queuing Protocol**. Think of it as a specific set of rules and procedures for how post offices should operate – how letters should be addressed, sorted, delivered, and confirmed. * RabbitMQ is a broker that speaks AMQP natively. * Other brokers, like Redis, use different protocols (their own set of rules). * **Why mention AMQP?** It's a very common and powerful protocol for message queuing, and the principles behind it (exchanges, queues, routing) are fundamental to how Celery routes tasks, even when using other brokers. Celery's internal component for handling this communication is often referred to as `app.amqp` (found in `app/amqp.py`), even though the underlying library (`kombu`) supports multiple protocols. So, we focus on the *concept* of managing the broker connection, often using AMQP terminology as a reference point. 4. **Producer (Sending Mail):** When your application calls `add.delay(2, 2)`, it acts as a *producer*. It uses its broker connection to send a message ("Please run 'add' with arguments (2, 2)") to the broker. 5. **Consumer (Receiving Mail):** A Celery [Worker](05_worker.md) acts as a *consumer*. It uses its *own* broker connection to constantly check a specific mailbox (queue) at the broker for new messages. When it finds one, it takes it, performs the task, and tells the broker it's done. ## How Sending a Task Uses the Connection Let's revisit sending a task from [Chapter 3: Task](03_task.md): ```python # run_tasks.py (simplified) from tasks import add from celery_app import app # Assume app is configured with a broker_url # 1. You call .delay() print("Sending task...") result_promise = add.delay(2, 2) # Behind the scenes: # a. Celery looks at the 'add' task, finds its associated 'app'. # b. It asks 'app' for the broker_url from its configuration. # c. It uses the app.amqp component (powered by Kombu) to get a connection # to the broker specified by the URL (e.g., 'amqp://localhost...'). # d. It packages the task name 'tasks.add' and args (2, 2) into a message. # e. It uses the connection to 'publish' (send) the message to the broker. print(f"Task sent! ID: {result_promise.id}") ``` The `add.delay(2, 2)` call triggers this whole process. It needs the configured `broker_url` to know *which* post office to connect to, and the broker connection handles the actual sending of the "letter" (task message). Similarly, a running Celery [Worker](05_worker.md) establishes its own connection to the *same* broker. It uses this connection to *listen* for incoming messages on the queues it's assigned to. ## How It Works Internally (Simplified) Celery uses a powerful library called **Kombu** to handle the low-level details of connecting and talking to different types of brokers (RabbitMQ, Redis, etc.). The `app.amqp` object in Celery acts as a high-level interface to Kombu's features. 1. **Configuration:** The `broker_url` tells Kombu where and how to connect. 2. **Connection Pool:** To be efficient, Celery (via Kombu) often maintains a *pool* of connections. When you send a task, it might grab an existing, idle connection from the pool instead of creating a new one every time. This is faster. You can see this managed by `app.producer_pool` in `app/base.py`. 3. **Producer:** When `task.delay()` is called, it ultimately uses a `kombu.Producer` object. This object represents the ability to *send* messages. It's tied to a specific connection and channel. 4. **Publishing:** The producer's `publish()` method is called. This takes the task message (already serialized into a format like JSON), specifies the destination (exchange and routing key - think of these like the address and sorting code on an envelope), and sends it over the connection to the broker. 5. **Consumer:** A Worker uses a `kombu.Consumer` object. This object is set up to listen on specific queues via its connection. When a message arrives in one of those queues, the broker pushes it to the consumer over the connection, and the consumer triggers the appropriate Celery task execution logic. ```mermaid sequenceDiagram participant Client as Your App Code participant Task as add.delay() participant App as Celery App participant AppAMQP as app.amqp (Kombu Interface) participant Broker as RabbitMQ / Redis Client->>Task: Call add.delay(2, 2) Task->>App: Get broker config (broker_url) App-->>Task: broker_url Task->>App: Ask to send task 'tasks.add' App->>AppAMQP: Send task message('tasks.add', (2, 2), ...) Note over AppAMQP: Gets connection/producer (maybe from pool) AppAMQP->>Broker: publish(message, routing_info) via Connection Broker-->>AppAMQP: Acknowledge message received AppAMQP-->>App: Message sent successfully App-->>Task: Return AsyncResult Task-->>Client: Return AsyncResult ``` This shows the flow: your code calls `.delay()`, Celery uses its configured connection details (`app.amqp` layer) to get a connection and producer, and then publishes the message to the broker. ## Code Dive: Sending a Message Let's peek inside `app/amqp.py` where the `AMQP` class orchestrates sending. The `send_task_message` method (simplified below) is key. ```python # Simplified from app/amqp.py within the AMQP class # This function is configured internally and gets called by app.send_task def _create_task_sender(self): # ... (lots of setup: getting defaults from config, signals) ... default_serializer = self.app.conf.task_serializer default_compressor = self.app.conf.task_compression def send_task_message(producer, name, message, exchange=None, routing_key=None, queue=None, serializer=None, compression=None, declare=None, retry=None, retry_policy=None, **properties): # ... (Determine exchange, routing_key, queue based on config/options) ... # ... (Prepare headers, properties, handle retries) ... headers, properties, body, sent_event = message # Unpack the prepared message tuple # The core action: Use the producer to publish the message! ret = producer.publish( body, # The actual task payload (args, kwargs, etc.) exchange=exchange, routing_key=routing_key, serializer=serializer or default_serializer, # e.g., 'json' compression=compression or default_compressor, retry=retry, retry_policy=retry_policy, declare=declare, # Maybe declare queues/exchanges if needed headers=headers, **properties # Other message properties (correlation_id, etc.) ) # ... (Send signals like task_sent, publish events if configured) ... return ret return send_task_message ``` **Explanation:** * This function takes a `producer` object (which is linked to a broker connection via Kombu). * It figures out the final destination details (exchange, routing key). * It calls `producer.publish()`, passing the task body and all the necessary options (like serializer). This is the function that actually sends the data over the network connection to the broker. The `Connection` objects themselves are managed by Kombu (see `kombu/connection.py`). Celery uses these objects via its `app.connection_for_write()` or `app.connection_for_read()` methods, which often pull from the connection pool (`kombu.pools`). ## Conclusion The Broker Connection is Celery's vital communication link, its "postal service." * It connects your application and workers to the **Message Broker** (like RabbitMQ or Redis). * It uses the `broker_url` from your [Configuration](02_configuration.md) to know where to connect. * Protocols like **AMQP** define the "rules" for communication, although Celery's underlying library (Kombu) handles various protocols. * Your app **produces** task messages and sends them over the connection. * Workers **consume** task messages received over their connection. * Celery manages connections efficiently, often using **pools**. Understanding the broker connection helps clarify how tasks move from where they're requested to where they run. Now that we know how tasks are defined and sent across the wire, let's look at the entity that actually picks them up and does the work. **Next:** [Chapter 5: Worker](05_worker.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/05_worker.md ================================================ --- layout: default title: "Worker" parent: "Celery" nav_order: 5 --- # Chapter 5: Worker - The Task Doer In [Chapter 4: Broker Connection (AMQP)](04_broker_connection__amqp_.md), we learned how Celery uses a message broker, like a postal service, to send task messages. When you call `add.delay(2, 2)`, a message asking to run the `add` task with arguments `(2, 2)` gets dropped into a mailbox (the broker queue). But who actually checks that mailbox, picks up the message, and performs the addition? That's the job of the **Celery Worker**. ## What Problem Does the Worker Solve? Imagine our workshop analogy again. You've defined the blueprint for a job ([Task](03_task.md)) and you've dropped the work order into the central inbox ([Broker Connection (AMQP)](04_broker_connection__amqp_.md)). Now you need an actual employee or a machine to: 1. Look in the inbox for new work orders. 2. Pick up an order. 3. Follow the instructions (run the task code). 4. Maybe put the finished product (the result) somewhere specific. 5. Mark the order as complete. The **Celery Worker** is that employee or machine. It's a separate program (process) that you run, whose sole purpose is to execute the tasks you send to the broker. Without a worker running, your task messages would just sit in the queue forever, waiting for someone to process them. ## Starting Your First Worker Running a worker is typically done from your command line or terminal. You need to tell the worker where to find your [Celery App](01_celery_app.md) instance (which holds the configuration, including the broker address and the list of known tasks). Assuming you have: * A file `celery_app.py` containing your `app = Celery(...)` instance. * A file `tasks.py` containing your task definitions (like `add` and `send_welcome_email`) decorated with `@app.task`. * Your message broker (e.g., Redis or RabbitMQ) running. You can start a worker like this: ```bash # In your terminal, in the same directory as celery_app.py and tasks.py # Make sure your Python environment has celery and the broker driver installed # (e.g., pip install celery redis) celery -A celery_app worker --loglevel=info ``` **Explanation:** * `celery`: This is the main Celery command-line program. * `-A celery_app`: The `-A` flag (or `--app`) tells Celery where to find your `Celery` app instance. `celery_app` refers to the `celery_app.py` file (or module) and implies Celery should look for an instance named `app` inside it. * `worker`: This specifies that you want to run the worker component. * `--loglevel=info`: This sets the logging level. `info` is a good starting point, showing you when the worker connects, finds tasks, and executes them. Other levels include `debug` (more verbose), `warning`, `error`, and `critical`. **What You'll See:** When the worker starts successfully, you'll see a banner like this (details may vary): ```text -------------- celery@yourhostname v5.x.x (stars) --- ***** ----- -- ******* ---- Linux-5.15.0...-generic-x86_64-with-... 2023-10-27 10:00:00 - *** --- * --- - ** ---------- [config] - ** ---------- .> app: tasks:0x7f... - ** ---------- .> transport: redis://localhost:6379/0 - ** ---------- .> results: redis://localhost:6379/0 - *** --- * --- .> concurrency: 8 (prefork) -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker) --- ***** ----- -------------- [queues] .> celery exchange=celery(direct) key=celery [tasks] . tasks.add . tasks.send_welcome_email [2023-10-27 10:00:01,000: INFO/MainProcess] Connected to redis://localhost:6379/0 [2023-10-27 10:00:01,050: INFO/MainProcess] mingle: searching for neighbors [2023-10-27 10:00:02,100: INFO/MainProcess] mingle: all alone [2023-10-27 10:00:02,150: INFO/MainProcess] celery@yourhostname ready. ``` **Key Parts of the Banner:** * `celery@yourhostname`: The unique name of this worker instance. * `transport`: The broker URL it connected to (from your app config). * `results`: The result backend URL (if configured). * `concurrency`: How many tasks this worker can potentially run at once (defaults to the number of CPU cores) and the execution pool type (`prefork` is common). We'll touch on this later. * `queues`: The specific "mailboxes" (queues) the worker is listening to. `celery` is the default queue name. * `[tasks]`: A list of all the tasks the worker discovered (like our `tasks.add` and `tasks.send_welcome_email`). If your tasks don't show up here, the worker won't be able to run them! The final `celery@yourhostname ready.` message means the worker is connected and waiting for jobs! ## What the Worker Does Now that the worker is running, let's trace what happens when you send a task (e.g., from `run_tasks.py` in [Chapter 3: Task](03_task.md)): 1. **Waiting:** The worker is connected to the broker, listening on the `celery` queue. 2. **Message Arrival:** Your `add.delay(5, 7)` call sends a message to the `celery` queue on the broker. The broker notifies the worker. 3. **Receive & Decode:** The worker receives the raw message. It decodes it to find the task name (`tasks.add`), the arguments (`(5, 7)`), and other info (like a unique task ID). 4. **Find Task Code:** The worker looks up the name `tasks.add` in its internal registry (populated when it started) to find the actual Python function `add` defined in `tasks.py`. 5. **Execute:** The worker executes the function: `add(5, 7)`. * You will see the `print` statements from your task function appear in the *worker's* terminal output: ```text [2023-10-27 10:05:00,100: INFO/ForkPoolWorker-1] Task tasks.add[some-task-id] received Task 'add' starting with (5, 7) Task 'add' finished with result: 12 [2023-10-27 10:05:05,150: INFO/ForkPoolWorker-1] Task tasks.add[some-task-id] succeeded in 5.05s: 12 ``` 6. **Store Result (Optional):** If a [Result Backend](06_result_backend.md) is configured, the worker takes the return value (`12`) and sends it to the backend, associating it with the task's unique ID. 7. **Acknowledge:** The worker sends an "acknowledgement" (ack) back to the broker. This tells the broker, "I finished processing this message successfully, you can delete it from the queue." This ensures tasks aren't lost if a worker crashes mid-execution (the message would remain on the queue for another worker to pick up). 8. **Wait Again:** The worker goes back to waiting for the next message. ## Running Multiple Workers and Concurrency * **Multiple Workers:** You can start multiple worker processes by running the `celery worker` command again, perhaps on different machines or in different terminals on the same machine. They will all connect to the same broker and pull tasks from the queue, allowing you to process tasks in parallel and scale your application. * **Concurrency within a Worker:** A single worker process can often handle more than one task concurrently. Celery achieves this using *execution pools*. * **Prefork (Default):** The worker starts several child *processes*. Each child process handles one task at a time. The `-c` (or `--concurrency`) flag controls the number of child processes (default is the number of CPU cores). This is good for CPU-bound tasks. * **Eventlet/Gevent:** Uses *green threads* (lightweight concurrency managed by libraries like eventlet or gevent). A single worker process can handle potentially hundreds or thousands of tasks concurrently, especially if the tasks are I/O-bound (e.g., waiting for network requests). You select these using the `-P` flag: `celery -A celery_app worker -P eventlet -c 1000`. Requires installing the respective library (`pip install eventlet` or `pip install gevent`). * **Solo:** Executes tasks one after another in the main worker process. Useful for debugging. `-P solo`. * **Threads:** Uses regular OS threads. `-P threads`. Less common for Celery tasks due to Python's Global Interpreter Lock (GIL) limitations for CPU-bound tasks, but can be useful for I/O-bound tasks. For beginners, sticking with the default **prefork** pool is usually fine. Just know that the worker can likely handle multiple tasks simultaneously. ## How It Works Internally (Simplified) Let's visualize the worker's main job: processing a single task. 1. **Startup:** The `celery worker` command starts the main worker process. It loads the `Celery App`, reads the configuration (`broker_url`, tasks to import, etc.). 2. **Connect & Listen:** The worker establishes a connection to the message broker and tells it, "I'm ready to consume messages from the 'celery' queue." 3. **Message Delivery:** The broker sees a message for the 'celery' queue (sent by `add.delay(5, 7)`) and delivers it to the connected worker. 4. **Consumer Receives:** The worker's internal "Consumer" component receives the message. 5. **Task Dispatch:** The Consumer decodes the message, identifies the task (`tasks.add`), and finds the arguments (`(5, 7)`). It then hands this off to the configured execution pool (e.g., prefork). 6. **Pool Execution:** The pool (e.g., a child process in the prefork pool) gets the task function and arguments and executes `add(5, 7)`. 7. **Result Return:** The pool process finishes execution and returns the result (`12`) back to the main worker process. 8. **Result Handling (Optional):** The main worker process, if a [Result Backend](06_result_backend.md) is configured, sends the result (`12`) and task ID to the backend store. 9. **Acknowledgement:** The main worker process sends an "ack" message back to the broker, confirming the task message was successfully processed. The broker then deletes the message. ```mermaid sequenceDiagram participant CLI as Terminal (celery worker) participant WorkerMain as Worker Main Process participant App as Celery App Instance participant Broker as Message Broker participant Pool as Execution Pool (e.g., Prefork Child) participant TaskCode as Your Task Function (add) CLI->>WorkerMain: Start celery -A celery_app worker WorkerMain->>App: Load App & Config (broker_url, tasks) WorkerMain->>Broker: Connect & Listen on 'celery' queue Broker-->>WorkerMain: Deliver Message ('tasks.add', (5, 7), task_id) WorkerMain->>WorkerMain: Decode Message WorkerMain->>Pool: Request Execute add(5, 7) with task_id Pool->>TaskCode: Run add(5, 7) TaskCode-->>Pool: Return 12 Pool-->>WorkerMain: Result=12 for task_id Note over WorkerMain: (Optionally) Store 12 in Result Backend WorkerMain->>Broker: Acknowledge task_id is complete ``` ## Code Dive: Where Worker Logic Lives * **Command Line Entry Point (`celery/bin/worker.py`):** This script handles parsing the command-line arguments (`-A`, `-l`, `-c`, `-P`, etc.) when you run `celery worker ...`. It ultimately creates and starts a `WorkController` instance. (See `worker()` function in the file). * **Main Worker Class (`celery/worker/worker.py`):** The `WorkController` class is the heart of the worker. It manages all the different components (like the pool, consumer, timer, etc.) using a system called "bootsteps". It handles the overall startup, shutdown, and coordination. (See `WorkController` class). * **Message Handling (`celery/worker/consumer/consumer.py`):** The `Consumer` class (specifically its `Blueprint` and steps like `Tasks` and `Evloop`) is responsible for the core loop of fetching messages from the broker via the connection, decoding them, and dispatching them to the execution pool using task strategies. (See `Consumer.create_task_handler`). * **Execution Pools (`celery/concurrency/`):** Modules like `prefork.py`, `solo.py`, `eventlet.py`, `gevent.py` implement the different concurrency models (`-P` flag). The `WorkController` selects and manages one of these pools. A highly simplified conceptual view of the core message processing logic within the `Consumer`: ```python # Conceptual loop inside the Consumer (highly simplified) def message_handler(message): try: # 1. Decode message (task name, args, kwargs, id, etc.) task_name, args, kwargs, task_id = decode_message(message.body, message.headers) # 2. Find the registered task function task_func = app.tasks[task_name] # 3. Prepare execution request for the pool request = TaskRequest(task_id, task_name, task_func, args, kwargs) # 4. Send request to the pool for execution # (Pool runs request.execute() which calls task_func(*args, **kwargs)) pool.apply_async(request.execute, accept_callback=task_succeeded, ...) except Exception as e: # Handle errors (e.g., unknown task, decoding error) log_error(e) message.reject() # Tell broker it failed def task_succeeded(task_id, retval): # Called by the pool when task finishes successfully # 5. Store result (optional) if app.backend: app.backend.store_result(task_id, retval, status='SUCCESS') # 6. Acknowledge message to broker message.ack() # --- Setup --- # Worker connects to broker and registers message_handler # for incoming messages on the subscribed queue(s) connection.consume(queue_name, callback=message_handler) # Start the event loop to wait for messages connection.drain_events() ``` This illustrates the fundamental cycle: receive -> decode -> find task -> execute via pool -> handle result -> acknowledge. The actual code involves much more detail regarding error handling, state management, different protocols, rate limiting, etc., managed through the bootstep system. ## Conclusion You've now met the **Celery Worker**, the essential component that actually *runs* your tasks. * It's a **separate process** you start from the command line (`celery worker`). * It connects to the **broker** using the configuration from your **Celery App**. * It **listens** for task messages on queues. * It **executes** the corresponding task code when a message arrives. * It handles **concurrency** using execution pools (like prefork, eventlet, gevent). * It **acknowledges** messages to the broker upon successful completion. Without workers, Celery tasks would never get done. But what happens when a task finishes? What if it returns a value, like our `add` task returning `12`? How can your original application find out the result? That's where the Result Backend comes in. **Next:** [Chapter 6: Result Backend](06_result_backend.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/06_result_backend.md ================================================ --- layout: default title: "Result Backend" parent: "Celery" nav_order: 6 --- # Chapter 6: Result Backend - Checking Your Task's Homework In [Chapter 5: Worker](05_worker.md), we met the Celery Worker, the diligent entity that picks up task messages from the [Broker Connection (AMQP)](04_broker_connection__amqp_.md) and executes the code defined in our [Task](03_task.md). But what happens after the worker finishes a task? What if the task was supposed to calculate something, like `add(2, 2)`? How do we, back in our main application, find out the answer (`4`)? Or even just know if the task finished successfully or failed? This is where the **Result Backend** comes in. It's like a dedicated place to check the status and results of the homework assigned to the workers. ## What Problem Does the Result Backend Solve? Imagine you give your Celery worker a math problem: "What is 123 + 456?". The worker goes away, calculates the answer (579), and... then what? If you don't tell the worker *where* to put the answer, it just disappears! You, back in your main program, have no idea if the worker finished, if it got the right answer, or if it encountered an error. The **Result Backend** solves this by providing a storage location (like a database, a cache like Redis, or even via the message broker itself) where the worker can: 1. Record the final **state** of the task (e.g., `SUCCESS`, `FAILURE`). 2. Store the task's **return value** (e.g., `579`) if it succeeded. 3. Store the **error** information (e.g., `TypeError: unsupported operand type(s)...`) if it failed. Later, your main application can query this Result Backend using the task's unique ID to retrieve this information. Think of it as a shared filing cabinet: * The **Worker** puts the completed homework (result and status) into a specific folder (identified by the task ID). * Your **Application** can later look inside that folder (using the task ID) to see the results. ## Key Concepts 1. **Storage:** It's a place to store task results and states. This could be Redis, a relational database (like PostgreSQL or MySQL), MongoDB, RabbitMQ (using RPC), and others. 2. **Task ID:** Each task execution gets a unique ID (remember the `result_promise_add.id` from Chapter 3?). This ID is the key used to store and retrieve the result from the backend. 3. **State:** Besides the return value, the backend stores the task's current state (e.g., `PENDING`, `STARTED`, `SUCCESS`, `FAILURE`, `RETRY`, `REVOKED`). 4. **Return Value / Exception:** If the task finishes successfully (`SUCCESS`), the backend stores the value the task function returned. If it fails (`FAILURE`), it stores details about the exception that occurred. 5. **`AsyncResult` Object:** When you call `task.delay()` or `task.apply_async()`, Celery gives you back an `AsyncResult` object. This object holds the task's ID and provides methods to interact with the result backend (check status, get the result, etc.). ## How to Use a Result Backend **1. Configure It!** First, you need to tell your Celery app *where* the result backend is located. You do this using the `result_backend` configuration setting, just like you set the `broker_url` in [Chapter 2: Configuration](02_configuration.md). Let's configure our app to use Redis (make sure you have Redis running!) as the result backend. We'll use database number `1` for results to keep it separate from the broker which might be using database `0`. ```python # celery_app.py from celery import Celery # Configure BOTH broker and result backend app = Celery('tasks', broker='redis://localhost:6379/0', backend='redis://localhost:6379/1') # <-- Result Backend URL # You could also use app.config_from_object('celeryconfig') # if result_backend = 'redis://localhost:6379/1' is in celeryconfig.py # ... your task definitions (@app.task) would go here or be imported ... @app.task def add(x, y): import time time.sleep(3) # Simulate work return x + y @app.task def fail_sometimes(x): import random if random.random() < 0.5: raise ValueError("Something went wrong!") return f"Processed {x}" ``` **Explanation:** * `backend='redis://localhost:6379/1'`: We provide a URL telling Celery to use the Redis server running on `localhost`, port `6379`, and specifically database `1` for storing results. (The `backend` argument is an alias for `result_backend`). **2. Send a Task and Get the `AsyncResult`** When you send a task, the returned object is your key to the result. ```python # run_tasks.py from celery_app import add, fail_sometimes # Send the add task result_add = add.delay(10, 20) print(f"Sent task add(10, 20). Task ID: {result_add.id}") # Send the task that might fail result_fail = fail_sometimes.delay("my data") print(f"Sent task fail_sometimes('my data'). Task ID: {result_fail.id}") ``` **Explanation:** * `result_add` and `result_fail` are `AsyncResult` objects. They contain the `.id` attribute, which is the unique identifier for *this specific execution* of the task. **3. Check the Status and Get the Result** Now, you can use the `AsyncResult` object to interact with the result backend. **(Run a worker in another terminal first: `celery -A celery_app worker --loglevel=info`)** ```python # continue in run_tasks.py or a new Python session from celery_app import app # Need app for AsyncResult if creating from ID # Use the AsyncResult objects we got earlier # Or, if you only have the ID, you can recreate the AsyncResult: # result_add = app.AsyncResult('the-task-id-you-saved-earlier') print(f"\nChecking results for add task ({result_add.id})...") # Check if the task is finished (returns True/False immediately) print(f"Is add ready? {result_add.ready()}") # Check the state (returns 'PENDING', 'STARTED', 'SUCCESS', 'FAILURE', etc.) print(f"State of add: {result_add.state}") # Get the result. IMPORTANT: This call will BLOCK until the task is finished! # If the task failed, this will raise the exception that occurred in the worker. try: # Set a timeout (in seconds) to avoid waiting forever final_result = result_add.get(timeout=10) print(f"Result of add: {final_result}") print(f"Did add succeed? {result_add.successful()}") print(f"Final state of add: {result_add.state}") except Exception as e: print(f"Could not get result for add: {type(e).__name__} - {e}") print(f"Final state of add: {result_add.state}") print(f"Did add fail? {result_add.failed()}") # Get the traceback if it failed print(f"Traceback: {result_add.traceback}") print(f"\nChecking results for fail_sometimes task ({result_fail.id})...") try: # Wait up to 10 seconds for this task fail_result = result_fail.get(timeout=10) print(f"Result of fail_sometimes: {fail_result}") print(f"Did fail_sometimes succeed? {result_fail.successful()}") print(f"Final state of fail_sometimes: {result_fail.state}") except Exception as e: print(f"Could not get result for fail_sometimes: {type(e).__name__} - {e}") print(f"Final state of fail_sometimes: {result_fail.state}") print(f"Did fail_sometimes fail? {result_fail.failed()}") print(f"Traceback:\n{result_fail.traceback}") ``` **Explanation & Potential Output:** * `result.ready()`: Checks if the task has finished (reached a `SUCCESS`, `FAILURE`, or other final state). Non-blocking. * `result.state`: Gets the current state string. Non-blocking. * `result.successful()`: Returns `True` if the state is `SUCCESS`. Non-blocking. * `result.failed()`: Returns `True` if the state is `FAILURE` or another exception state. Non-blocking. * `result.get(timeout=...)`: This is the most common way to get the actual return value. * **It blocks** (waits) until the task completes *or* the timeout expires. * If the task state becomes `SUCCESS`, it returns the value the task function returned (e.g., `30`). * If the task state becomes `FAILURE`, it **raises** the exception that occurred in the worker (e.g., `ValueError: Something went wrong!`). * If the timeout is reached before the task finishes, it raises a `celery.exceptions.TimeoutError`. * `result.traceback`: If the task failed, this contains the error traceback string from the worker. **(Example Output - might vary for `fail_sometimes` due to randomness)** ```text Sent task add(10, 20). Task ID: f5e8a3f6-c7b1-4a9e-8f0a-1b2c3d4e5f6a Sent task fail_sometimes('my data'). Task ID: 9b1d8c7e-a6f5-4b3a-9c8d-7e6f5a4b3c2d Checking results for add task (f5e8a3f6-c7b1-4a9e-8f0a-1b2c3d4e5f6a)... Is add ready? False State of add: PENDING # Or STARTED if checked quickly after worker picks it up Result of add: 30 Did add succeed? True Final state of add: SUCCESS Checking results for fail_sometimes task (9b1d8c7e-a6f5-4b3a-9c8d-7e6f5a4b3c2d)... Could not get result for fail_sometimes: ValueError - Something went wrong! Final state of fail_sometimes: FAILURE Did fail_sometimes fail? True Traceback: Traceback (most recent call last): File "/path/to/celery/app/trace.py", line ..., in trace_task R = retval = fun(*args, **kwargs) File "/path/to/celery/app/trace.py", line ..., in __protected_call__ return self.run(*args, **kwargs) File "/path/to/your/project/celery_app.py", line ..., in fail_sometimes raise ValueError("Something went wrong!") ValueError: Something went wrong! ``` ## How It Works Internally 1. **Task Sent:** Your application calls `add.delay(10, 20)`. It sends a message to the **Broker** and gets back an `AsyncResult` object containing the unique `task_id`. 2. **Worker Executes:** A **Worker** picks up the task message from the Broker. It finds the `add` function and executes `add(10, 20)`. The function returns `30`. 3. **Worker Stores Result:** Because a `result_backend` is configured (`redis://.../1`), the Worker: * Connects to the Result Backend (Redis DB 1). * Prepares the result data (e.g., `{'status': 'SUCCESS', 'result': 30, 'task_id': 'f5e8...', ...}`). * Stores this data in the backend, using the `task_id` as the key (e.g., in Redis, it might set a key like `celery-task-meta-f5e8a3f6-c7b1-4a9e-8f0a-1b2c3d4e5f6a` to the JSON representation of the result data). * It might also set an expiry time on the result if configured (`result_expires`). 4. **Client Checks Result:** Your application calls `result_add.get(timeout=10)` on the `AsyncResult` object. 5. **Client Queries Backend:** The `AsyncResult` object uses the `task_id` (`f5e8...`) and the configured `result_backend` URL: * It connects to the Result Backend (Redis DB 1). * It repeatedly fetches the data associated with the `task_id` key (e.g., `GET celery-task-meta-f5e8...` in Redis). * It checks the `status` field in the retrieved data. * If the status is `PENDING` or `STARTED`, it waits a short interval and tries again, until the timeout is reached. * If the status is `SUCCESS`, it extracts the `result` field (`30`) and returns it. * If the status is `FAILURE`, it extracts the `result` field (which contains exception info), reconstructs the exception, and raises it. ```mermaid sequenceDiagram participant Client as Your Application participant Task as add.delay(10, 20) participant Broker as Message Broker (Redis DB 0) participant Worker as Celery Worker participant ResultBackend as Result Backend (Redis DB 1) participant AsyncResult as result_add = AsyncResult(...) Client->>Task: Call add.delay(10, 20) Task->>Broker: Send task message (task_id: 't1') Task-->>Client: Return AsyncResult (id='t1') Worker->>Broker: Fetch message (task_id: 't1') Worker->>Worker: Execute add(10, 20) -> returns 30 Worker->>ResultBackend: Store result (key='t1', value={'status': 'SUCCESS', 'result': 30, ...}) ResultBackend-->>Worker: Ack (Result stored) Worker->>Broker: Ack message complete Client->>AsyncResult: Call result_add.get(timeout=10) loop Check Backend Until Ready or Timeout AsyncResult->>ResultBackend: Get result for key='t1' ResultBackend-->>AsyncResult: Return {'status': 'SUCCESS', 'result': 30, ...} end AsyncResult-->>Client: Return 30 ``` ## Code Dive: Storing and Retrieving Results * **Backend Loading (`celery/app/backends.py`):** When Celery starts, it uses the `result_backend` URL to look up the correct backend class (e.g., `RedisBackend`, `DatabaseBackend`, `RPCBackend`) using functions like `by_url` and `by_name`. These map URL schemes (`redis://`, `db+postgresql://`, `rpc://`) or aliases ('redis', 'db', 'rpc') to the actual Python classes. The mapping is defined in `BACKEND_ALIASES`. * **Base Classes (`celery/backends/base.py`):** All result backends inherit from `BaseBackend`. Many common backends (like Redis, Memcached) inherit from `BaseKeyValueStoreBackend`, which provides common logic for storing results using keys. * **Storing Result (`BaseKeyValueStoreBackend._store_result` in `celery/backends/base.py`):** This method (called by the worker) is responsible for actually saving the result. ```python # Simplified from backends/base.py (inside BaseKeyValueStoreBackend) def _store_result(self, task_id, result, state, traceback=None, request=None, **kwargs): # 1. Prepare the metadata dictionary meta = self._get_result_meta(result=result, state=state, traceback=traceback, request=request) meta['task_id'] = bytes_to_str(task_id) # Ensure task_id is str # (Check if already successfully stored to prevent overwrites - omitted for brevity) # 2. Encode the metadata (e.g., to JSON or pickle) encoded_meta = self.encode(meta) # 3. Get the specific key for this task key = self.get_key_for_task(task_id) # e.g., b'celery-task-meta-' # 4. Call the specific backend's 'set' method (implemented by RedisBackend etc.) # It might also set an expiry time (self.expires) try: self._set_with_state(key, encoded_meta, state) # Calls self.set(key, encoded_meta) except Exception as exc: # Handle potential storage errors, maybe retry raise BackendStoreError(...) from exc return result # Returns the original (unencoded) result ``` The `self.set()` method is implemented by the concrete backend (e.g., `RedisBackend.set` uses `redis-py` client's `setex` or `set` command). * **Retrieving Result (`BaseBackend.wait_for` or `BaseKeyValueStoreBackend.get_many` in `celery/backends/base.py`):** When you call `AsyncResult.get()`, it often ends up calling `wait_for` or similar methods that poll the backend. ```python # Simplified from backends/base.py (inside SyncBackendMixin) def wait_for(self, task_id, timeout=None, interval=0.5, no_ack=True, on_interval=None): """Wait for task and return its result meta.""" self._ensure_not_eager() # Check if running in eager mode time_elapsed = 0.0 while True: # 1. Get metadata from backend (calls self._get_task_meta_for) meta = self.get_task_meta(task_id) # 2. Check if the task is in a final state if meta['status'] in states.READY_STATES: return meta # Return the full metadata dict # 3. Call interval callback if provided if on_interval: on_interval() # 4. Sleep to avoid busy-waiting time.sleep(interval) time_elapsed += interval # 5. Check for timeout if timeout and time_elapsed >= timeout: raise TimeoutError('The operation timed out.') ``` The `self.get_task_meta(task_id)` eventually calls `self._get_task_meta_for(task_id)`, which in `BaseKeyValueStoreBackend` uses `self.get(key)` (e.g., `RedisBackend.get` uses `redis-py` client's `GET` command) and then decodes the result using `self.decode_result`. ## Conclusion You've learned about the crucial **Result Backend**: * It acts as a **storage place** (like a filing cabinet or database) for task results and states. * It's configured using the `result_backend` setting in your [Celery App](01_celery_app.md). * The [Worker](05_worker.md) stores the outcome (success value or failure exception) in the backend after executing a [Task](03_task.md). * You use the `AsyncResult` object (returned by `.delay()` or `.apply_async()`) and its methods (`.get()`, `.state`, `.ready()`) to query the backend using the task's unique ID. * Various backend types exist (Redis, Database, RPC, etc.), each with different characteristics. Result backends allow your application to track the progress and outcome of background work. But what if you want tasks to run automatically at specific times or on a regular schedule, like sending a report every morning? That's where Celery's scheduler comes in. **Next:** [Chapter 7: Beat (Scheduler)](07_beat__scheduler_.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/07_beat__scheduler_.md ================================================ --- layout: default title: "Beat (Scheduler)" parent: "Celery" nav_order: 7 --- # Chapter 7: Beat (Scheduler) - Celery's Alarm Clock In the last chapter, [Chapter 6: Result Backend](06_result_backend.md), we learned how to track the status and retrieve the results of our background tasks. This is great when we manually trigger tasks from our application. But what if we want tasks to run automatically, without us needing to press a button every time? Maybe you need to: * Send out a newsletter email every Friday morning. * Clean up temporary files in your system every night. * Check the health of your external services every 5 minutes. How can you make Celery do these things on a regular schedule? Meet **Celery Beat**. ## What Problem Does Beat Solve? Imagine you have a task, say `send_daily_report()`, that needs to run every morning at 8:00 AM. How would you achieve this? You could try setting up a system `cron` job to call a Python script that sends the Celery task, but that adds another layer of complexity. Celery provides its own built-in solution: **Beat**. **Beat is Celery's periodic task scheduler.** Think of it like a dedicated alarm clock or a `cron` job system built specifically for triggering Celery tasks. It's a separate program that you run alongside your workers. Its job is simple: 1. Read a list of scheduled tasks (e.g., "run `send_daily_report` every day at 8:00 AM"). 2. Keep track of the time. 3. When the time comes for a scheduled task, Beat sends the task message to the [Broker Connection (AMQP)](04_broker_connection__amqp_.md), just as if you had called `.delay()` yourself. 4. A regular Celery [Worker](05_worker.md) then picks up the task from the broker and executes it. Beat doesn't run the tasks itself; it just *schedules* them by sending the messages at the right time. ## Key Concepts 1. **Beat Process:** A separate Celery program you run (like `celery -A your_app beat`). It needs access to your Celery app's configuration. 2. **Schedule:** A configuration setting (usually `beat_schedule` in your Celery config) that defines which tasks should run and when. This schedule can use simple intervals (like every 30 seconds) or cron-like patterns (like "every Monday at 9 AM"). 3. **Schedule Storage:** Beat needs to remember when each task was last run so it knows when it's due again. By default, it saves this information to a local file named `celerybeat-schedule` (using Python's `shelve` module). 4. **Ticker:** The heart of Beat. It's an internal loop that wakes up periodically, checks the schedule against the current time, and sends messages for any due tasks. ## How to Use Beat Let's schedule two tasks: * Our `add` task from [Chapter 3: Task](03_task.md) to run every 15 seconds. * A new (dummy) task `send_report` to run every minute. **1. Define the Schedule in Configuration** The best place to define your schedule is in your configuration, either directly on the `app` object or in a separate `celeryconfig.py` file (see [Chapter 2: Configuration](02_configuration.md)). We'll use a separate file. First, create the new task in your `tasks.py`: ```python # tasks.py (add this new task) from celery_app import app import time @app.task def add(x, y): """A simple task that adds two numbers.""" print(f"Task 'add' starting with ({x}, {y})") time.sleep(2) # Simulate short work result = x + y print(f"Task 'add' finished with result: {result}") return result @app.task def send_report(name): """A task simulating sending a report.""" print(f"Task 'send_report' starting for report: {name}") time.sleep(5) # Simulate longer work print(f"Report '{name}' supposedly sent.") return f"Report {name} sent." ``` Now, update or create `celeryconfig.py`: ```python # celeryconfig.py from datetime import timedelta from celery.schedules import crontab # Basic Broker/Backend settings (replace with your actual URLs) broker_url = 'redis://localhost:6379/0' result_backend = 'redis://localhost:6379/1' timezone = 'UTC' # Or your preferred timezone, e.g., 'America/New_York' enable_utc = True # List of modules to import when the Celery worker starts. # Make sure tasks.py is discoverable in your Python path imports = ('tasks',) # Define the Beat schedule beat_schedule = { # Executes tasks.add every 15 seconds with arguments (16, 16) 'add-every-15-seconds': { 'task': 'tasks.add', # The task name 'schedule': 15.0, # Run every 15 seconds (float or timedelta) 'args': (16, 16), # Positional arguments for the task }, # Executes tasks.send_report every minute 'send-report-every-minute': { 'task': 'tasks.send_report', 'schedule': crontab(), # Use crontab() for "every minute" 'args': ('daily-summary',), # Argument for the report name # Example using crontab for more specific timing: # 'schedule': crontab(hour=8, minute=0, day_of_week='fri'), # Every Friday at 8:00 AM }, } ``` **Explanation:** * `from datetime import timedelta`: Used for simple interval schedules. * `from celery.schedules import crontab`: Used for cron-like scheduling. * `imports = ('tasks',)`: Ensures the worker and beat know about the tasks defined in `tasks.py`. * `beat_schedule = {...}`: This dictionary holds all your scheduled tasks. * Each key (`'add-every-15-seconds'`, `'send-report-every-minute'`) is a unique name for the schedule entry. * Each value is another dictionary describing the schedule: * `'task'`: The full name of the task to run (e.g., `'module_name.task_name'`). * `'schedule'`: Defines *when* to run. * A `float` or `int`: number of seconds between runs. * A `timedelta` object: the time interval between runs. * A `crontab` object: for complex schedules (minute, hour, day_of_week, etc.). `crontab()` with no arguments means "every minute". * `'args'`: A tuple of positional arguments to pass to the task. * `'kwargs'`: (Optional) A dictionary of keyword arguments to pass to the task. * `'options'`: (Optional) A dictionary of execution options like `queue`, `priority`. **2. Load the Configuration in Your App** Make sure your `celery_app.py` loads this configuration: ```python # celery_app.py from celery import Celery # Create the app instance app = Celery('tasks') # Load configuration from the 'celeryconfig' module app.config_from_object('celeryconfig') # Tasks might be defined here, but we put them in tasks.py # which is loaded via the 'imports' setting in celeryconfig.py ``` **3. Run Celery Beat** Now, open a terminal and run the Beat process. You need to tell it where your app is (`-A celery_app`): ```bash # In your terminal celery -A celery_app beat --loglevel=info ``` **Explanation:** * `celery`: The Celery command-line tool. * `-A celery_app`: Points to your app instance (in `celery_app.py`). * `beat`: Tells Celery to start the scheduler process. * `--loglevel=info`: Shows informational messages about what Beat is doing. You'll see output similar to this: ```text celery beat v5.x.x is starting. __ - ... __ - _ LocalTime -> 2023-10-27 11:00:00 Configuration -> . broker -> redis://localhost:6379/0 . loader -> celery.loaders.app.AppLoader . scheduler -> celery.beat.PersistentScheduler . db -> celerybeat-schedule . logfile -> [stderr]@INFO . maxinterval -> 300.0s (5m0s) celery beat v5.x.x has started. ``` Beat is now running! It will check the schedule and: * Every 15 seconds, it will send a message to run `tasks.add(16, 16)`. * Every minute, it will send a message to run `tasks.send_report('daily-summary')`. **4. Run a Worker (Crucial!)** Beat only *sends* the task messages. You still need a [Worker](05_worker.md) running to actually *execute* the tasks. Open **another terminal** and start a worker: ```bash # In a SECOND terminal celery -A celery_app worker --loglevel=info ``` Now, watch the output in the **worker's terminal**. You should see logs appearing periodically as the worker receives and executes the tasks sent by Beat: ```text # Output in the WORKER terminal (example) [2023-10-27 11:00:15,000: INFO/MainProcess] Task tasks.add[task-id-1] received Task 'add' starting with (16, 16) Task 'add' finished with result: 32 [2023-10-27 11:00:17,050: INFO/MainProcess] Task tasks.add[task-id-1] succeeded in 2.05s: 32 [2023-10-27 11:01:00,000: INFO/MainProcess] Task tasks.send_report[task-id-2] received Task 'send_report' starting for report: daily-summary [2023-10-27 11:01:00,000: INFO/MainProcess] Task tasks.add[task-id-3] received # Another 'add' task might arrive while 'send_report' runs Task 'add' starting with (16, 16) Task 'add' finished with result: 32 [2023-10-27 11:01:02,050: INFO/MainProcess] Task tasks.add[task-id-3] succeeded in 2.05s: 32 Report 'daily-summary' supposedly sent. [2023-10-27 11:01:05,100: INFO/MainProcess] Task tasks.send_report[task-id-2] succeeded in 5.10s: "Report daily-summary sent." ... and so on ... ``` You have successfully set up scheduled tasks! ## How It Works Internally (Simplified) 1. **Startup:** You run `celery -A celery_app beat`. The Beat process starts. 2. **Load Config:** It loads the Celery app (`celery_app`) and reads its configuration, paying special attention to `beat_schedule`. 3. **Load State:** It opens the schedule file (e.g., `celerybeat-schedule`) to see when each task was last run. If the file doesn't exist, it creates it. 4. **Main Loop (Tick):** Beat enters its main loop (the "ticker"). 5. **Calculate Due Tasks:** In each tick, Beat looks at every entry in `beat_schedule`. For each entry, it compares the current time with the task's `schedule` definition and its `last_run_at` time (from the schedule file). It calculates which tasks are due to run *right now*. 6. **Send Task Message:** If a task (e.g., `add-every-15-seconds`) is due, Beat constructs a task message (containing `'tasks.add'`, `args=(16, 16)`, etc.) just like `.delay()` would. It sends this message to the configured **Broker**. 7. **Update State:** Beat updates the `last_run_at` time for the task it just sent in its internal state and saves this back to the schedule file. 8. **Sleep:** Beat calculates the time until the *next* scheduled task is due and sleeps for that duration (or up to a maximum interval, `beat_max_loop_interval`, usually 5 minutes, whichever is shorter). 9. **Repeat:** Go back to step 5. Meanwhile, a **Worker** process is connected to the same **Broker**, picks up the task messages sent by Beat, and executes them. ```mermaid sequenceDiagram participant Beat as Celery Beat Process participant ScheduleCfg as beat_schedule Config participant ScheduleDB as celerybeat-schedule File participant Broker as Message Broker participant Worker as Celery Worker Beat->>ScheduleCfg: Load schedule definitions on startup Beat->>ScheduleDB: Load last run times on startup loop Tick Loop (e.g., every second or more) Beat->>Beat: Check current time Beat->>ScheduleCfg: Get definition for 'add-every-15' Beat->>ScheduleDB: Get last run time for 'add-every-15' Beat->>Beat: Calculate if 'add-every-15' is due now alt Task 'add-every-15' is due Beat->>Broker: Send task message('tasks.add', (16, 16)) Broker-->>Beat: Ack (Message Queued) Beat->>ScheduleDB: Update last run time for 'add-every-15' ScheduleDB-->>Beat: Ack (Saved) end Beat->>Beat: Calculate time until next task is due Beat->>Beat: Sleep until next check end Worker->>Broker: Fetch task message ('tasks.add', ...) Broker-->>Worker: Deliver message Worker->>Worker: Execute task add(16, 16) Worker->>Broker: Ack message complete ``` ## Code Dive: Where Beat Lives * **Command Line (`celery/bin/beat.py`):** Handles the `celery beat` command, parses arguments (`-A`, `-s`, `-S`, `--loglevel`), and creates/runs the `Beat` service object. * **Beat Service Runner (`celery/apps/beat.py`):** The `Beat` class sets up the environment, loads the app, initializes logging, creates the actual scheduler service (`celery.beat.Service`), installs signal handlers, and starts the service. * **Beat Service (`celery/beat.py:Service`):** This class manages the lifecycle of the scheduler. Its `start()` method contains the main loop that repeatedly calls `scheduler.tick()`. It loads the scheduler class specified in the configuration (defaulting to `PersistentScheduler`). * **Scheduler (`celery/beat.py:Scheduler` / `PersistentScheduler`):** This is the core logic. * `Scheduler` is the base class. Its `tick()` method calculates the time until the next event, finds due tasks, calls `apply_entry` for due tasks, and returns the sleep interval. * `PersistentScheduler` inherits from `Scheduler` and adds the logic to load/save the schedule state (last run times) using `shelve` (the `celerybeat-schedule` file). It overrides methods like `setup_schedule`, `sync`, `close`, and `schedule` property to interact with the `shelve` store (`self._store`). * **Schedule Types (`celery/schedules.py`):** Defines classes like `schedule` (for `timedelta` intervals) and `crontab`. These classes implement the `is_due(last_run_at)` method, which the `Scheduler.tick()` method uses to determine if a task entry should run. A simplified conceptual look at the `beat_schedule` config structure: ```python # Example structure from celeryconfig.py beat_schedule = { 'schedule-name-1': { # Unique name for this entry 'task': 'my_app.tasks.task1', # Task to run (module.task_name) 'schedule': 30.0, # When to run (e.g., seconds, timedelta, crontab) 'args': (arg1, arg2), # Optional: Positional arguments 'kwargs': {'key': 'value'}, # Optional: Keyword arguments 'options': {'queue': 'hipri'},# Optional: Execution options }, 'schedule-name-2': { 'task': 'my_app.tasks.task2', 'schedule': crontab(minute=0, hour=0), # e.g., Run at midnight # ... other options ... }, } ``` And a very simplified concept of the `Scheduler.tick()` method: ```python # Simplified conceptual logic of Scheduler.tick() def tick(self): remaining_times = [] due_tasks = [] # 1. Iterate through schedule entries for entry in self.schedule.values(): # self.schedule reads from PersistentScheduler._store['entries'] # 2. Check if entry is due using its schedule object (e.g., crontab) is_due, next_time_to_run = entry.is_due() # Calls schedule.is_due(entry.last_run_at) if is_due: due_tasks.append(entry) else: remaining_times.append(next_time_to_run) # Store time until next check # 3. Apply due tasks (send message to broker) for entry in due_tasks: self.apply_entry(entry) # Sends task message and updates entry's last_run_at in schedule store # 4. Calculate minimum sleep time until next event return min(remaining_times + [self.max_interval]) ``` ## Conclusion Celery Beat is your tool for automating task execution within the Celery ecosystem. * It acts as a **scheduler**, like an alarm clock or `cron` for Celery tasks. * It runs as a **separate process** (`celery beat`). * You define the schedule using the `beat_schedule` setting in your configuration, specifying **what** tasks run, **when** (using intervals or crontabs), and with what **arguments**. * Beat **sends task messages** to the broker at the scheduled times. * Running **Workers** are still required to pick up and execute these tasks. Beat allows you to reliably automate recurring background jobs, from simple periodic checks to complex, time-specific operations. Now that we know how to run individual tasks, get their results, and schedule them automatically, what if we want to create more complex workflows involving multiple tasks that depend on each other? That's where Celery's Canvas comes in. **Next:** [Chapter 8: Canvas (Signatures & Primitives)](08_canvas__signatures___primitives_.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/08_canvas__signatures___primitives_.md ================================================ --- layout: default title: "Canvas (Signatures & Primitives)" parent: "Celery" nav_order: 8 --- # Chapter 8: Canvas (Signatures & Primitives) - Building Task Workflows In the previous chapter, [Chapter 7: Beat (Scheduler)](07_beat__scheduler_.md), we learned how to schedule tasks to run automatically at specific times using Celery Beat. This is great for recurring jobs. But what if you need to run a sequence of tasks, where one task depends on the result of another? Or run multiple tasks in parallel and then collect their results? Imagine you're building a feature where a user uploads an article, and you need to: 1. Fetch the article content from a URL. 2. Process the text to extract keywords. 3. Process the text to detect the language. 4. Once *both* processing steps are done, save the article and the extracted metadata to your database. Simply running these tasks independently won't work. Keyword extraction and language detection can happen at the same time, but only *after* the content is fetched. Saving can only happen *after* both processing steps are complete. How do you orchestrate this multi-step workflow? This is where **Celery Canvas** comes in. It provides the building blocks to design complex task workflows. ## What Problem Does Canvas Solve? Canvas helps you connect individual [Task](03_task.md)s together to form more sophisticated processes. It solves the problem of defining dependencies and flow control between tasks. Instead of just firing off tasks one by one and hoping they complete in the right order or manually checking results, Canvas lets you declare the desired workflow structure directly. Think of it like having different types of Lego bricks: * Some bricks represent a single task. * Other bricks let you connect tasks end-to-end (run in sequence). * Some let you stack bricks side-by-side (run in parallel). * Others let you build a structure where several parallel steps must finish before the next piece is added. Canvas gives you these connecting bricks for your Celery tasks. ## Key Concepts: Signatures and Primitives The core ideas in Canvas are **Signatures** and **Workflow Primitives**. 1. **Signature (`signature` or `.s()`): The Basic Building Block** * A `Signature` wraps up everything needed to call a single task: the task's name, the arguments (`args`), the keyword arguments (`kwargs`), and any execution options (like `countdown`, `eta`, queue name). * Think of it as a **pre-filled request form** or a **recipe card** for a specific task execution. It doesn't *run* the task immediately; it just holds the plan for running it. * The easiest way to create a signature is using the `.s()` shortcut on a task function. ```python # tasks.py from celery_app import app # Assuming app is defined in celery_app.py @app.task def add(x, y): return x + y # Create a signature for add(2, 3) add_sig = add.s(2, 3) # add_sig now holds the 'plan' to run add(2, 3) print(f"Signature: {add_sig}") print(f"Task name: {add_sig.task}") print(f"Arguments: {add_sig.args}") # To actually run it, you call .delay() or .apply_async() ON the signature # result_promise = add_sig.delay() ``` **Output:** ```text Signature: tasks.add(2, 3) Task name: tasks.add Arguments: (2, 3) ``` 2. **Primitives: Connecting the Blocks** Canvas provides several functions (primitives) to combine signatures into workflows: * **`chain`:** Links tasks sequentially. The result of the first task is passed as the first argument to the second task, and so on. * Analogy: An assembly line where each station passes its output to the next. * Syntax: `(sig1 | sig2 | sig3)` or `chain(sig1, sig2, sig3)` * **`group`:** Runs a list of tasks in parallel. It returns a special result object that helps track the group. * Analogy: Hiring several workers to do similar jobs independently at the same time. * Syntax: `group(sig1, sig2, sig3)` * **`chord`:** Runs a group of tasks in parallel (the "header"), and *then*, once *all* tasks in the group have finished successfully, it runs a single callback task (the "body") with the results of the header tasks. * Analogy: A team of researchers works on different parts of a project in parallel. Once everyone is done, a lead researcher collects all the findings to write the final report. * Syntax: `chord(group(header_sigs), body_sig)` There are other primitives like `chunks`, `xmap`, and `starmap`, but `chain`, `group`, and `chord` are the most fundamental ones for building workflows. ## How to Use Canvas: Building the Article Processing Workflow Let's build the workflow we described earlier: Fetch -> (Process Keywords & Detect Language in parallel) -> Save. **1. Define the Tasks** First, we need our basic tasks. Let's create dummy versions in `tasks.py`: ```python # tasks.py from celery_app import app import time import random @app.task def fetch_data(url): print(f"Fetching data from {url}...") time.sleep(1) # Simulate fetching some data data = f"Content from {url} - {random.randint(1, 100)}" print(f"Fetched: {data}") return data @app.task def process_part_a(data): print(f"Processing Part A for: {data}") time.sleep(2) result_a = f"Keywords for '{data}'" print("Part A finished.") return result_a @app.task def process_part_b(data): print(f"Processing Part B for: {data}") time.sleep(3) # Simulate slightly longer processing result_b = f"Language for '{data}'" print("Part B finished.") return result_b @app.task def combine_results(results): # 'results' will be a list containing the return values # of process_part_a and process_part_b print(f"Combining results: {results}") time.sleep(1) final_output = f"Combined: {results[0]} | {results[1]}" print(f"Final Output: {final_output}") return final_output ``` **2. Define the Workflow Using Canvas** Now, in a separate script or Python shell, let's define the workflow using signatures and primitives. ```python # run_workflow.py from celery import chain, group, chord from tasks import fetch_data, process_part_a, process_part_b, combine_results # The URL we want to process article_url = "http://example.com/article1" # Create the workflow structure # 1. Fetch data. The result (data) is passed to the next step. # 2. The next step is a chord: # - Header: A group running process_part_a and process_part_b in parallel. # Both tasks receive the 'data' from fetch_data. # - Body: combine_results receives a list of results from the group. workflow = chain( fetch_data.s(article_url), # Step 1: Fetch chord( # Step 2: Chord group(process_part_a.s(), process_part_b.s()), # Header: Parallel processing combine_results.s() # Body: Combine results ) ) print(f"Workflow definition:\n{workflow}") # Start the workflow print("\nSending workflow to Celery...") result_promise = workflow.apply_async() print(f"Workflow sent! Final result ID: {result_promise.id}") print("Run a Celery worker to execute the tasks.") # You can optionally wait for the final result: # final_result = result_promise.get() # print(f"\nWorkflow finished! Final result: {final_result}") ``` **Explanation:** * We import `chain`, `group`, `chord` from `celery`. * We import our task functions. * `fetch_data.s(article_url)`: Creates a signature for the first step. * `process_part_a.s()` and `process_part_b.s()`: Create signatures for the parallel tasks. Note that we *don't* provide the `data` argument here. `chain` automatically passes the result of `fetch_data` to the *next* task in the sequence. Since the next task is a `chord` containing a `group`, Celery cleverly passes the `data` to *each* task within that group. * `combine_results.s()`: Creates the signature for the final step (the chord's body). It doesn't need arguments initially because the `chord` will automatically pass the list of results from the header group to it. * `chain(...)`: Connects `fetch_data` to the `chord`. * `chord(group(...), ...)`: Defines that the group must finish before `combine_results` is called. * `group(...)`: Defines that `process_part_a` and `process_part_b` run in parallel. * `workflow.apply_async()`: This sends the *first* task (`fetch_data`) to the broker. The rest of the workflow is encoded in the task's options (like `link` or `chord` information) so that Celery knows what to do next after each step completes. If you run this script (and have a [Worker](05_worker.md) running), you'll see the tasks execute in the worker logs, respecting the defined dependencies and parallelism. `fetch_data` runs first, then `process_part_a` and `process_part_b` run concurrently, and finally `combine_results` runs after both A and B are done. ## How It Works Internally (Simplified Walkthrough) Let's trace a simpler workflow: `my_chain = (add.s(2, 2) | add.s(4))` 1. **Workflow Definition:** When you create `my_chain`, Celery creates a `chain` object containing the signatures `add.s(2, 2)` and `add.s(4)`. 2. **Sending (`my_chain.apply_async()`):** * Celery looks at the first task in the chain: `add.s(2, 2)`. * It prepares to send this task message to the [Broker Connection (AMQP)](04_broker_connection__amqp_.md). * Crucially, it adds a special option to the message, often called `link` (or uses the `chain` field in newer protocols). This option contains the *signature* of the next task in the chain: `add.s(4)`. * The message for `add(2, 2)` (with the link to `add(4)`) is sent to the broker. 3. **Worker 1 Executes First Task:** * A [Worker](05_worker.md) picks up the message for `add(2, 2)`. * It runs the `add` function with arguments `(2, 2)`. The result is `4`. * The worker stores the result `4` in the [Result Backend](06_result_backend.md) (if configured). * The worker notices the `link` option in the original message, pointing to `add.s(4)`. 4. **Worker 1 Sends Second Task:** * The worker takes the result of the first task (`4`). * It uses the linked signature `add.s(4)`. * It *prepends* the result (`4`) to the arguments of the linked signature, making it effectively `add.s(4, 4)`. *(Note: The original `4` in `add.s(4)` came from the chain definition, the first `4` is the result)*. * It sends a *new* message to the broker for `add(4, 4)`. 5. **Worker 2 Executes Second Task:** * Another (or the same) worker picks up the message for `add(4, 4)`. * It runs `add(4, 4)`. The result is `8`. * It stores the result `8` in the backend. * There are no more links, so the chain is complete. `group` works by sending all task messages in the group concurrently. `chord` is more complex; it involves the workers coordinating via the [Result Backend](06_result_backend.md) to count completed tasks in the header before the callback task is finally sent. ```mermaid sequenceDiagram participant Client as Your Code participant Canvas as workflow = chain(...) participant Broker as Message Broker participant Worker as Celery Worker Client->>Canvas: workflow.apply_async() Note over Canvas: Prepare msg for add(2, 2) with link=add.s(4) Canvas->>Broker: Send Task 1 msg ('add', (2, 2), link=add.s(4), id=T1) Broker-->>Canvas: Ack Canvas-->>Client: Return AsyncResult(id=T2) # ID of the *last* task in chain Worker->>Broker: Fetch msg (T1) Broker-->>Worker: Deliver Task 1 msg Worker->>Worker: Execute add(2, 2) -> returns 4 Note over Worker: Store result 4 for T1 in Backend Worker->>Worker: Check 'link' option -> add.s(4) Note over Worker: Prepare msg for add(4, 4) using result 4 + linked args Worker->>Broker: Send Task 2 msg ('add', (4, 4), id=T2) Broker-->>Worker: Ack Worker->>Broker: Ack Task 1 msg complete Worker->>Broker: Fetch msg (T2) Broker-->>Worker: Deliver Task 2 msg Worker->>Worker: Execute add(4, 4) -> returns 8 Note over Worker: Store result 8 for T2 in Backend Worker->>Broker: Ack Task 2 msg complete ``` ## Code Dive: Canvas Implementation The logic for signatures and primitives resides primarily in `celery/canvas.py`. * **`Signature` Class:** * Defined in `celery/canvas.py`. It's essentially a dictionary subclass holding `task`, `args`, `kwargs`, `options`, etc. * The `.s()` method on a `Task` instance (in `celery/app/task.py`) is a shortcut to create a `Signature`. * `apply_async`: Prepares arguments/options by calling `_merge` and then delegates to `self.type.apply_async` (the task's method) or `app.send_task`. * `link`, `link_error`: Methods that modify the `options` dictionary to add callbacks. * `__or__`: The pipe operator (`|`) overload. It checks the type of the right-hand operand (`other`) and constructs a `_chain` object accordingly. ```python # Simplified from celery/canvas.py class Signature(dict): # ... methods like __init__, clone, set, apply_async ... def link(self, callback): # Appends callback signature to the 'link' list in options return self.append_to_list_option('link', callback) def link_error(self, errback): # Appends errback signature to the 'link_error' list in options return self.append_to_list_option('link_error', errback) def __or__(self, other): # Called when you use the pipe '|' operator if isinstance(other, Signature): # task | task -> chain return _chain(self, other, app=self._app) # ... other cases for group, chain ... return NotImplemented ``` * **`_chain` Class:** * Also in `celery/canvas.py`, inherits from `Signature`. Its `task` name is hardcoded to `'celery.chain'`. The actual task signatures are stored in `kwargs['tasks']`. * `apply_async` / `run`: Contains the logic to handle sending the first task with the rest of the chain embedded in the options (either via `link` for protocol 1 or the `chain` message property for protocol 2). * `prepare_steps`: This complex method recursively unwraps nested primitives (like a chain within a chain, or a group that needs to become a chord) and sets up the linking between steps. ```python # Simplified concept from celery/canvas.py (chain execution) class _chain(Signature): # ... __init__, __or__ ... def apply_async(self, args=None, kwargs=None, **options): # ... handle always_eager ... return self.run(args, kwargs, app=self.app, **options) def run(self, args=None, kwargs=None, app=None, **options): # ... setup ... tasks, results = self.prepare_steps(...) # Unroll and freeze tasks if results: # If there are tasks to run first_task = tasks.pop() # Get the first task (list is reversed) remaining_chain = tasks if tasks else None # Determine how to pass the chain info (link vs. message field) use_link = self._use_link # ... logic to decide ... if use_link: # Protocol 1: Link first task to the second task if remaining_chain: first_task.link(remaining_chain.pop()) # (Worker handles subsequent links) options_to_apply = options # Pass original options else: # Protocol 2: Embed the rest of the reversed chain in options options_to_apply = ChainMap({'chain': remaining_chain}, options) # Send the *first* task only result_from_apply = first_task.apply_async(**options_to_apply) # Return AsyncResult of the *last* task in the original chain return results[0] ``` * **`group` Class:** * In `celery/canvas.py`. Its `task` name is `'celery.group'`. * `apply_async`: Iterates through its `tasks`, freezes each one (assigning a common `group_id`), sends their messages, and collects the `AsyncResult` objects into a `GroupResult`. It uses a `barrier` (from the `vine` library) to track completion. * **`chord` Class:** * In `celery/canvas.py`. Its `task` name is `'celery.chord'`. * `apply_async` / `run`: Coordinates with the result backend (`backend.apply_chord`). It typically runs the header `group` first, configuring it to notify the backend upon completion. The backend then triggers the `body` task once the count is reached. ## Conclusion Celery Canvas transforms simple tasks into powerful workflow components. * A **Signature** (`task.s()`) captures the details for a single task call without running it. * Primitives like **`chain`** (`|`), **`group`**, and **`chord`** combine signatures to define complex execution flows: * `chain`: Sequence (output of one to input of next). * `group`: Parallel execution. * `chord`: Parallel execution followed by a callback with all results. * You compose these primitives like building with Lego bricks to model your application's logic. * Calling `.apply_async()` on a workflow primitive starts the process by sending the first task(s), embedding the rest of the workflow logic in the task options or using backend coordination. Canvas allows you to move complex orchestration logic out of your application code and into Celery, making your tasks more modular and your overall system more robust. Now that you can build and run complex workflows, how do you monitor what's happening inside Celery? How do you know when tasks start, finish, or fail in real-time? **Next:** [Chapter 9: Events](09_events.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/09_events.md ================================================ --- layout: default title: "Events" parent: "Celery" nav_order: 9 --- # Chapter 9: Events - Listening to Celery's Heartbeat In [Chapter 8: Canvas (Signatures & Primitives)](08_canvas__signatures___primitives_.md), we saw how to build complex workflows by chaining tasks together or running them in parallel. But as your Celery system gets busier, you might wonder: "What are my workers doing *right now*? Which tasks have started? Which ones finished successfully or failed?" Imagine you're running an important data processing job involving many tasks. Wouldn't it be great to have a live dashboard showing the progress, or get immediate notifications if something goes wrong? This is where **Celery Events** come in. ## What Problem Do Events Solve? Celery Events provide a **real-time monitoring system** for your tasks and workers. Think of it like a live activity log or a notification system built into Celery. Without events, finding out what happened requires checking logs or querying the [Result Backend](06_result_backend.md) for each task individually. This isn't ideal for getting a live overview of the entire cluster. Events solve this by having workers broadcast messages (events) about important actions they take, such as: * A worker coming online or going offline. * A worker receiving a task. * A worker starting to execute a task. * A task succeeding or failing. * A worker sending out a heartbeat signal. Other programs can then listen to this stream of event messages to monitor the health and activity of the Celery cluster in real-time, build dashboards (like the popular tool Flower), or trigger custom alerts. ## Key Concepts 1. **Events:** Special messages sent by workers (and sometimes clients) describing an action. Each event has a `type` (e.g., `task-received`, `worker-online`) and contains details relevant to that action (like the task ID, worker hostname, timestamp). 2. **Event Exchange:** Events aren't sent to the regular task queues. They are published to a dedicated, named exchange on the [Broker Connection (AMQP)](04_broker_connection__amqp_.md). Think of it as a separate broadcast channel just for monitoring messages. 3. **Event Sender (`EventDispatcher`):** A component within the [Worker](05_worker.md) responsible for creating and sending event messages to the broker's event exchange. This is usually disabled by default for performance reasons. 4. **Event Listener (`EventReceiver`):** Any program that connects to the event exchange on the broker and consumes the stream of event messages. This could be the `celery events` command-line tool, Flower, or your own custom monitoring script. 5. **Event Types:** Celery defines many event types. Some common ones include: * `worker-online`, `worker-offline`, `worker-heartbeat`: Worker status updates. * `task-sent`: Client sent a task request (requires `task_send_sent_event` setting). * `task-received`: Worker received the task message. * `task-started`: Worker started executing the task code. * `task-succeeded`: Task finished successfully. * `task-failed`: Task failed with an error. * `task-retried`: Task is being retried. * `task-revoked`: Task was cancelled/revoked. ## How to Use Events: Simple Monitoring Let's see how to enable events and watch the live stream using Celery's built-in tool. **1. Enable Events in the Worker** By default, workers don't send events to save resources. You need to explicitly tell them to start sending. You can do this in two main ways: * **Command-line flag (`-E`):** When starting your worker, add the `-E` flag. ```bash # Start a worker AND enable sending events celery -A celery_app worker --loglevel=info -E ``` * **Configuration Setting:** Set `worker_send_task_events = True` in your Celery configuration ([Chapter 2: Configuration](02_configuration.md)). This is useful if you always want events enabled for workers using that configuration. You can also enable worker-specific events (`worker-online`, `worker-heartbeat`) with `worker_send_worker_events = True` (which defaults to True). ```python # celeryconfig.py (example) broker_url = 'redis://localhost:6379/0' result_backend = 'redis://localhost:6379/1' imports = ('tasks',) # Enable sending task-related events task_send_sent_event = False # Optional: If you want task-sent events too worker_send_task_events = True worker_send_worker_events = True # Usually True by default ``` Now, any worker started with this configuration (or the `-E` flag) will publish events to the broker. **2. Watch the Event Stream** Celery provides a command-line tool called `celery events` that acts as a simple event listener and prints the events it receives to your console. Open **another terminal** (while your worker with events enabled is running) and run: ```bash # Watch for events associated with your app celery -A celery_app events ``` Alternatively, you can use the more descriptive (but older) command `celery control enable_events` to tell already running workers to start sending events, and `celery control disable_events` to stop them. **What You'll See:** Initially, `celery events` might show nothing. Now, try sending a task from another script or shell (like the `run_tasks.py` from [Chapter 3: Task](03_task.md)): ```python # In a third terminal/shell from tasks import add result = add.delay(5, 10) print(f"Sent task {result.id}") ``` Switch back to the terminal running `celery events`. You should see output similar to this (details and timestamps will vary): ```text -> celery events v5.x.x -> connected to redis://localhost:6379/0 -------------- task-received celery@myhostname [2023-10-27 12:00:01.100] uuid:a1b2c3d4-e5f6-7890-1234-567890abcdef name:tasks.add args:[5, 10] kwargs:{} retries:0 eta:null hostname:celery@myhostname timestamp:1666872001.1 pid:12345 ... -------------- task-started celery@myhostname [2023-10-27 12:00:01.150] uuid:a1b2c3d4-e5f6-7890-1234-567890abcdef hostname:celery@myhostname timestamp:1666872001.15 pid:12345 ... -------------- task-succeeded celery@myhostname [2023-10-27 12:00:04.200] uuid:a1b2c3d4-e5f6-7890-1234-567890abcdef result:'15' runtime:3.05 hostname:celery@myhostname timestamp:1666872004.2 pid:12345 ... ``` **Explanation:** * `celery events` connects to the broker defined in `celery_app`. * It listens for messages on the event exchange. * As the worker processes the `add(5, 10)` task, it sends `task-received`, `task-started`, and `task-succeeded` events. * `celery events` receives these messages and prints their details. This gives you a raw, real-time feed of what's happening in your Celery cluster! **Flower: A Visual Monitor** While `celery events` is useful, it's quite basic. A very popular tool called **Flower** uses the same event stream to provide a web-based dashboard for monitoring your Celery cluster. It shows running tasks, completed tasks, worker status, task details, and more, all updated in real-time thanks to Celery Events. You can typically install it (`pip install flower`) and run it (`celery -A celery_app flower`). ## How It Works Internally (Simplified) 1. **Worker Action:** A worker performs an action (e.g., starts executing task `T1`). 2. **Event Dispatch:** If events are enabled, the worker's internal `EventDispatcher` component is notified. 3. **Create Event Message:** The `EventDispatcher` creates a dictionary representing the event (e.g., `{'type': 'task-started', 'uuid': 'T1', 'hostname': 'worker1', ...}`). 4. **Publish to Broker:** The `EventDispatcher` uses its connection to the [Broker Connection (AMQP)](04_broker_connection__amqp_.md) to publish this event message to a specific **event exchange** (usually named `celeryev`). It uses a routing key based on the event type (e.g., `task.started`). 5. **Listener Connects:** A monitoring tool (like `celery events` or Flower) starts up. It creates an `EventReceiver`. 6. **Declare Queue:** The `EventReceiver` connects to the same broker and declares a temporary, unique queue bound to the event exchange (`celeryev`), often configured to receive all event types (`#` routing key). 7. **Consume Events:** The `EventReceiver` starts consuming messages from its dedicated queue. 8. **Process Event:** When an event message (like the `task-started` message for `T1`) arrives from the broker, the `EventReceiver` decodes it and passes it to a handler (e.g., `celery events` prints it, Flower updates its web UI). ```mermaid sequenceDiagram participant Worker participant Dispatcher as EventDispatcher (in Worker) participant Broker as Message Broker participant Receiver as EventReceiver (e.g., celery events tool) participant Display as Console/UI Worker->>Worker: Starts executing Task T1 Worker->>Dispatcher: Notify: Task T1 started Dispatcher->>Dispatcher: Create event message {'type': 'task-started', ...} Dispatcher->>Broker: Publish event msg to 'celeryev' exchange (routing_key='task.started') Broker-->>Dispatcher: Ack (Message Sent) Receiver->>Broker: Connect and declare unique queue bound to 'celeryev' exchange Broker-->>Receiver: Queue ready Broker->>Receiver: Deliver event message {'type': 'task-started', ...} Receiver->>Receiver: Decode message Receiver->>Display: Process event (e.g., print to console) ``` ## Code Dive: Sending and Receiving Events * **Enabling Events (`celery/worker/consumer/events.py`):** The `Events` bootstep in the worker process is responsible for initializing the `EventDispatcher`. The `-E` flag or configuration settings control whether this bootstep actually enables the dispatcher. ```python # Simplified from worker/consumer/events.py class Events(bootsteps.StartStopStep): requires = (Connection,) def __init__(self, c, task_events=True, # Controlled by config/flags # ... other flags ... **kwargs): self.send_events = task_events # or other flags self.enabled = self.send_events # ... super().__init__(c, **kwargs) def start(self, c): # ... gets connection ... # Creates the actual dispatcher instance dis = c.event_dispatcher = c.app.events.Dispatcher( c.connection_for_write(), hostname=c.hostname, enabled=self.send_events, # Only sends if enabled # ... other options ... ) # ... flush buffer ... ``` * **Sending Events (`celery/events/dispatcher.py`):** The `EventDispatcher` class has the `send` method, which creates the event dictionary and calls `publish`. ```python # Simplified from events/dispatcher.py class EventDispatcher: # ... __init__ setup ... def send(self, type, blind=False, ..., **fields): if self.enabled: groups, group = self.groups, group_from(type) if groups and group not in groups: return # Don't send if this group isn't enabled # ... potential buffering logic (omitted) ... # Call publish to actually send return self.publish(type, fields, self.producer, blind=blind, Event=Event, ...) def publish(self, type, fields, producer, blind=False, Event=Event, **kwargs): # Create the event dictionary clock = None if blind else self.clock.forward() event = Event(type, hostname=self.hostname, utcoffset=utcoffset(), pid=self.pid, clock=clock, **fields) # Publish using the underlying Kombu producer with self.mutex: return self._publish(event, producer, routing_key=type.replace('-', '.'), **kwargs) def _publish(self, event, producer, routing_key, **kwargs): exchange = self.exchange # The dedicated event exchange try: # Kombu's publish method sends the message producer.publish( event, # The dictionary payload routing_key=routing_key, exchange=exchange.name, declare=[exchange], # Ensure exchange exists serializer=self.serializer, # e.g., 'json' headers=self.headers, delivery_mode=self.delivery_mode, # e.g., transient **kwargs ) except Exception as exc: # ... error handling / buffering ... raise ``` * **Receiving Events (`celery/events/receiver.py`):** The `EventReceiver` class (used by tools like `celery events`) sets up a consumer to listen for messages on the event exchange. ```python # Simplified from events/receiver.py class EventReceiver(ConsumerMixin): # Uses Kombu's ConsumerMixin def __init__(self, channel, handlers=None, routing_key='#', ...): # ... setup app, channel, handlers ... self.exchange = get_exchange(..., name=self.app.conf.event_exchange) self.queue = Queue( # Create a unique, auto-deleting queue '.'.join([self.queue_prefix, self.node_id]), exchange=self.exchange, routing_key=routing_key, # Often '#' to get all events auto_delete=True, durable=False, # ... other queue options ... ) # ... def get_consumers(self, Consumer, channel): # Tell ConsumerMixin to consume from our event queue return [Consumer(queues=[self.queue], callbacks=[self._receive], # Method to call on message no_ack=True, # Events usually don't need explicit ack accept=self.accept)] # This method is registered as the callback for new messages def _receive(self, body, message): # Decode message body (can be single event or list in newer Celery) if isinstance(body, list): process, from_message = self.process, self.event_from_message [process(*from_message(event)) for event in body] else: self.process(*self.event_from_message(body)) # process() calls the appropriate handler from self.handlers def process(self, type, event): """Process event by dispatching to configured handler.""" handler = self.handlers.get(type) or self.handlers.get('*') handler and handler(event) # Call the handler function ``` ## Conclusion Celery Events provide a powerful mechanism for **real-time monitoring** of your distributed task system. * Workers (when enabled via `-E` or configuration) send **event messages** describing their actions (like task start/finish, worker online). * These messages go to a dedicated **event exchange** on the broker. * Tools like `celery events` or Flower act as **listeners** (`EventReceiver`), consuming this stream to provide insights into the cluster's activity. * Events are the foundation for building dashboards, custom monitoring, and diagnostic tools. Understanding events helps you observe and manage your Celery application more effectively. So far, we've explored the major components and concepts of Celery. But how does a worker actually start up? How does it initialize all these different parts like the connection, the consumer, the event dispatcher, and the execution pool in the right order? That's orchestrated by a system called Bootsteps. **Next:** [Chapter 10: Bootsteps](10_bootsteps.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/10_bootsteps.md ================================================ --- layout: default title: "Bootsteps" parent: "Celery" nav_order: 10 --- # Chapter 10: Bootsteps - How Celery Workers Start Up In [Chapter 9: Events](09_events.md), we learned how to monitor the real-time activity within our Celery system. We've now covered most of the key parts of Celery: the [Celery App](01_celery_app.md), [Task](03_task.md)s, the [Broker Connection (AMQP)](04_broker_connection__amqp_.md), the [Worker](05_worker.md), the [Result Backend](06_result_backend.md), [Beat (Scheduler)](07_beat__scheduler_.md), [Canvas (Signatures & Primitives)](08_canvas__signatures___primitives_.md), and [Events](09_events.md). But have you ever wondered how the Celery worker manages to get all these different parts working together when you start it? When you run `celery worker`, it needs to connect to the broker, set up the execution pool, start listening for tasks, maybe start the event dispatcher, and possibly even start an embedded Beat scheduler. How does it ensure all these things happen in the correct order? That's where **Bootsteps** come in. ## What Problem Do Bootsteps Solve? Imagine you're assembling a complex piece of furniture. You have many parts and screws, and the instructions list a specific sequence of steps. You can't attach the tabletop before you've built the legs! Similarly, a Celery worker has many internal components that need to be initialized and started in a precise order. For example, the worker needs to: 1. Establish a connection to the [Broker Connection (AMQP)](04_broker_connection__amqp_.md). 2. *Then*, start the consumer logic that uses this connection to fetch tasks. 3. Set up the execution pool (like prefork or eventlet) that will actually run the tasks. 4. Start optional components like the [Events](09_events.md) dispatcher or the embedded [Beat (Scheduler)](07_beat__scheduler_.md). If these steps happen out of order (e.g., trying to fetch tasks before connecting to the broker), the worker will fail. **Bootsteps** provide a framework within Celery to define this startup (and shutdown) sequence. It's like the assembly instructions or a detailed checklist for the worker. Each major component or initialization phase is defined as a "step," and steps can declare dependencies on each other (e.g., "Step B requires Step A to be finished"). Celery uses this information to automatically figure out the correct order to start everything up and, just as importantly, the correct reverse order to shut everything down cleanly. This makes the worker's internal structure more organized, modular, and easier for Celery developers to extend with new features. As a user, you generally don't write bootsteps yourself, but understanding the concept helps demystify the worker's startup process. ## Key Concepts 1. **Step (`Step`):** A single, distinct part of the worker's startup or shutdown logic. Think of it as one instruction in the assembly manual. Examples include initializing the broker connection, starting the execution pool, or starting the component that listens for task messages (the consumer). 2. **Blueprint (`Blueprint`):** A collection of related steps that manage a larger component. For instance, the main "Consumer" component within the worker has its own blueprint defining steps for connection, event handling, task fetching, etc. 3. **Dependencies (`requires`):** A step can declare that it needs other steps to be completed first. For example, the step that starts fetching tasks (`Tasks`) *requires* the step that establishes the broker connection (`Connection`). 4. **Order:** Celery analyzes the `requires` declarations of all steps within a blueprint (and potentially across blueprints) to build a dependency graph. It then sorts this graph to determine the exact order in which steps must be started. Shutdown usually happens in the reverse order. ## How It Works: The Worker Startup Sequence You don't typically interact with bootsteps directly, but you see their effect every time you start a worker. When you run: `celery -A your_app worker --loglevel=info` Celery initiates the **Worker Controller** (`WorkController`). This controller uses the Bootstep framework, specifically a main **Blueprint**, to manage its initialization. Here's a simplified idea of what happens under the hood, orchestrated by Bootsteps: 1. **Load Blueprint:** The `WorkController` loads its main blueprint, which includes steps for core functionalities. 2. **Build Graph:** Celery looks at all the steps defined in the blueprint (e.g., `Connection`, `Pool`, `Consumer`, `Timer`, `Events`, potentially `Beat`) and their `requires` attributes. It builds a dependency graph. 3. **Determine Order:** It calculates the correct startup order from the graph (a "topological sort"). For example, it determines that `Connection` must start before `Consumer`, and `Pool` must start before `Consumer` can start dispatching tasks to it. 4. **Execute Steps:** The `WorkController` iterates through the steps in the determined order and calls each step's `start` method. * The `Connection` step establishes the link to the broker. * The `Timer` step sets up internal timers. * The `Pool` step initializes the execution pool (e.g., starts prefork child processes). * The `Events` step starts the event dispatcher (if `-E` was used). * The `Consumer` step (usually last) starts the main loop that fetches tasks from the broker and dispatches them to the pool. 5. **Worker Ready:** Once all essential bootsteps have successfully started, the worker prints the "ready" message and begins processing tasks. When you stop the worker (e.g., with Ctrl+C), a similar process happens in reverse using the steps' `stop` or `terminate` methods, ensuring connections are closed, pools are shut down, etc., in the correct order. ## Internal Implementation Walkthrough Let's visualize the simplified startup flow managed by bootsteps: ```mermaid sequenceDiagram participant CLI as `celery worker ...` participant WorkerMain as Worker Main Process participant Blueprint as Main Worker Blueprint participant DepGraph as Dependency Graph Builder participant Step1 as Connection Step participant Step2 as Pool Step participant Step3 as Consumer Step CLI->>WorkerMain: Start worker command WorkerMain->>Blueprint: Load blueprint definition (steps & requires) Blueprint->>DepGraph: Define steps and dependencies DepGraph->>Blueprint: Return sorted startup order [Step1, Step2, Step3] WorkerMain->>Blueprint: Iterate through sorted steps Blueprint->>Step1: Call start() Step1-->>Blueprint: Connection established Blueprint->>Step2: Call start() Step2-->>Blueprint: Pool initialized Blueprint->>Step3: Call start() Step3-->>Blueprint: Consumer loop started Blueprint-->>WorkerMain: Startup complete WorkerMain->>WorkerMain: Worker is Ready ``` The Bootstep framework relies on classes defined mainly in `celery/bootsteps.py`. ## Code Dive: Anatomy of a Bootstep Bootsteps are defined as classes inheriting from `Step` or `StartStopStep`. * **Defining a Step:** A step class defines its logic and dependencies. ```python # Simplified concept from celery/bootsteps.py # Base class for all steps class Step: # List of other Step classes needed before this one runs requires = () def __init__(self, parent, **kwargs): # Called when the blueprint is applied to the parent (e.g., Worker) # Can be used to set initial attributes on the parent. pass def create(self, parent): # Create the service/component managed by this step. # Often returns an object to be stored. pass def include(self, parent): # Logic to add this step to the parent's step list. # Called after __init__. if self.should_include(parent): self.obj = self.create(parent) # Store created object if needed parent.steps.append(self) return True return False # A common step type with start/stop/terminate methods class StartStopStep(Step): obj = None # Holds the object created by self.create def start(self, parent): # Logic to start the component/service if self.obj and hasattr(self.obj, 'start'): self.obj.start() def stop(self, parent): # Logic to stop the component/service gracefully if self.obj and hasattr(self.obj, 'stop'): self.obj.stop() def terminate(self, parent): # Logic to force shutdown (if different from stop) if self.obj: term_func = getattr(self.obj, 'terminate', None) or getattr(self.obj, 'stop', None) if term_func: term_func() # include() method adds self to parent.steps if created ``` **Explanation:** * `requires`: A tuple of other Step classes that must be fully started *before* this step's `start` method is called. This defines the dependencies. * `__init__`, `create`, `include`: Methods involved in setting up the step and potentially creating the component it manages. * `start`, `stop`, `terminate`: Methods called during the worker's lifecycle (startup, graceful shutdown, forced shutdown). * **Blueprint:** Manages a collection of steps. ```python # Simplified concept from celery/bootsteps.py from celery.utils.graph import DependencyGraph class Blueprint: # Set of default step classes (or string names) included in this blueprint default_steps = set() def __init__(self, steps=None, name=None, **kwargs): self.name = name or self.__class__.__name__ # Combine default steps with any provided steps self.types = set(steps or []) | set(self.default_steps) self.steps = {} # Will hold step instances self.order = [] # Will hold sorted step instances # ... other callbacks ... def apply(self, parent, **kwargs): # 1. Load step classes from self.types step_classes = self.claim_steps() # {name: StepClass, ...} # 2. Build the dependency graph self.graph = DependencyGraph( ((Cls, Cls.requires) for Cls in step_classes.values()), # ... formatter options ... ) # 3. Get the topologically sorted order sorted_classes = self.graph.topsort() # 4. Instantiate and include each step self.order = [] for S in sorted_classes: step = S(parent, **kwargs) # Call Step.__init__ self.steps[step.name] = step self.order.append(step) for step in self.order: step.include(parent) # Call Step.include -> Step.create return self def start(self, parent): # Called by the parent (e.g., Worker) to start all steps for step in self.order: # Use the sorted order if hasattr(step, 'start'): step.start(parent) def stop(self, parent): # Called by the parent to stop all steps (in reverse order) for step in reversed(self.order): if hasattr(step, 'stop'): step.stop(parent) # ... other methods like close, terminate, restart ... ``` **Explanation:** * `default_steps`: Defines the standard components managed by this blueprint. * `apply`: The core method that takes the step definitions, builds the `DependencyGraph` based on `requires`, gets the sorted execution `order`, and then instantiates and includes each step. * `start`/`stop`: Iterate through the calculated `order` (or its reverse) to start/stop the components managed by each step. * **Example Usage (Worker Components):** The worker's main components are defined as bootsteps in `celery/worker/components.py`. You can see classes like `Pool`, `Consumer`, `Timer`, `Beat`, each inheriting from `bootsteps.Step` or `bootsteps.StartStopStep` and potentially defining `requires`. The `Consumer` blueprint in `celery/worker/consumer/consumer.py` then lists many of these (`Connection`, `Events`, `Tasks`, etc.) in its `default_steps`. ## Conclusion You've learned about Bootsteps, the underlying framework that brings order to the Celery worker's startup and shutdown procedures. * They act as an **assembly guide** or **checklist** for the worker. * Each core function (connecting, starting pool, consuming tasks) is a **Step**. * Steps declare **Dependencies** (`requires`) on each other. * A **Blueprint** groups related steps. * Celery uses a **Dependency Graph** to determine the correct **order** to start and stop steps. * This ensures components like the [Broker Connection (AMQP)](04_broker_connection__amqp_.md), [Worker](05_worker.md) pool, and task consumer initialize and terminate predictably. While you typically don't write bootsteps as an end-user, understanding their role clarifies how the complex machinery of a Celery worker reliably comes to life and shuts down. --- This concludes our introductory tour of Celery's core concepts! We hope these chapters have given you a solid foundation for understanding how Celery works and how you can use it to build robust and scalable distributed applications. Happy tasking! --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Celery/index.md ================================================ --- layout: default title: "Celery" nav_order: 5 has_children: true --- # Tutorial: Celery > This tutorial is AI-generated! To learn more, check out [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) Celery[View Repo](https://github.com/celery/celery/tree/d1c35bbdf014f13f4ab698d75e3ea381a017b090/celery) is a system for running **distributed tasks** *asynchronously*. You define *units of work* (Tasks) in your Python code. When you want a task to run, you send a message using a **message broker** (like RabbitMQ or Redis). One or more **Worker** processes are running in the background, listening for these messages. When a worker receives a message, it executes the corresponding task. Optionally, the task's result (or any error) can be stored in a **Result Backend** (like Redis or a database) so you can check its status or retrieve the output later. Celery helps manage this whole process, making it easier to handle background jobs, scheduled tasks, and complex workflows. ```mermaid flowchart TD A0["Celery App"] A1["Task"] A2["Worker"] A3["Broker Connection (AMQP)"] A4["Result Backend"] A5["Canvas (Signatures & Primitives)"] A6["Beat (Scheduler)"] A7["Configuration"] A8["Events"] A9["Bootsteps"] A0 -- "Defines and sends" --> A1 A0 -- "Uses for messaging" --> A3 A0 -- "Uses for results" --> A4 A0 -- "Loads and uses" --> A7 A1 -- "Updates state in" --> A4 A2 -- "Executes" --> A1 A2 -- "Fetches tasks from" --> A3 A2 -- "Uses for lifecycle" --> A9 A5 -- "Represents task invocation" --> A1 A6 -- "Sends scheduled tasks via" --> A3 A8 -- "Sends events via" --> A3 A9 -- "Manages connection via" --> A3 ``` ================================================ FILE: docs/Click/01_command___group.md ================================================ --- layout: default title: "Command & Group" parent: "Click" nav_order: 1 --- # Chapter 1: Commands and Groups: The Building Blocks Welcome to your first step in learning Click! Imagine you want to create your own command-line tool, maybe something like `git` or `docker`. How do you tell your program what to do when someone types `git commit` or `docker build`? That's where **Commands** and **Groups** come in. They are the fundamental building blocks for any Click application. Think about a simple tool. Maybe you want a program that can greet someone. You'd type `greet Alice` in your terminal, and it would print "Hello Alice!". In Click, this single action, "greet", would be represented by a `Command`. Now, what if your tool needed to do *more* than one thing? Maybe besides greeting, it could also say goodbye. You might want to type `mytool greet Alice` or `mytool goodbye Bob`. The main `mytool` part acts like a container or a menu, holding the different actions (`greet`, `goodbye`). This container is what Click calls a `Group`. So: * `Command`: Represents a single action your tool can perform. * `Group`: Represents a collection of related actions (Commands or other Groups). Let's dive in and see how to create them! ## Your First Command Creating a command in Click is surprisingly simple. You basically write a normal Python function and then "decorate" it to tell Click it's a command-line command. Let's make a command that just prints "Hello World!". ```python # hello_app.py import click @click.command() def hello(): """A simple command that says Hello World""" print("Hello World!") if __name__ == '__main__': hello() ``` Let's break this down: 1. `import click`: We need to import the Click library first. 2. `@click.command()`: This is the magic part! It's called a decorator. It transforms the Python function `hello()` right below it into a Click `Command` object. We'll learn more about [Decorators](02_decorators.md) in the next chapter, but for now, just know this line turns `hello` into something Click understands as a command. 3. `def hello(): ...`: This is a standard Python function. The code inside this function is what will run when you execute the command from your terminal. 4. `"""A simple command that says Hello World"""`: This is a docstring. Click cleverly uses the function's docstring as the help text for the command! 5. `if __name__ == '__main__': hello()`: This standard Python construct checks if the script is being run directly. If it is, it calls our `hello` command function (which is now actually a Click `Command` object). **Try running it!** Save the code above as `hello_app.py`. Open your terminal in the same directory and run: ```bash $ python hello_app.py Hello World! ``` It works! You just created your first command-line command with Click. **Bonus: Automatic Help!** Click automatically generates help screens for you. Try running your command with `--help`: ```bash $ python hello_app.py --help Usage: hello_app.py [OPTIONS] A simple command that says Hello World Options: --help Show this message and exit. ``` See? Click used the docstring we wrote (`A simple command that says Hello World`) and added a standard `--help` option for free! ## Grouping Commands Okay, one command is nice, but real tools often have multiple commands. Like `git` has `commit`, `pull`, `push`, etc. Let's say we want our tool to have two commands: `hello` and `goodbye`. We need a way to group these commands together. That's what `click.group()` is for. A `Group` acts as the main entry point and can have other commands attached to it. ```python # multi_app.py import click # 1. Create the main group @click.group() def cli(): """A simple tool with multiple commands.""" pass # The group function itself doesn't need to do anything # 2. Define the 'hello' command @click.command() def hello(): """Says Hello World""" print("Hello World!") # 3. Define the 'goodbye' command @click.command() def goodbye(): """Says Goodbye World""" print("Goodbye World!") # 4. Attach the commands to the group cli.add_command(hello) cli.add_command(goodbye) if __name__ == '__main__': cli() # Run the main group ``` What's changed? 1. We created a function `cli` and decorated it with `@click.group()`. This makes `cli` our main entry point, a container for other commands. Notice the function body is just `pass` – often, the group function itself doesn't need logic; its job is to hold other commands. 2. We defined `hello` and `goodbye` just like before, using `@click.command()`. 3. Crucially, we *attached* our commands to the group: `cli.add_command(hello)` and `cli.add_command(goodbye)`. This tells Click that `hello` and `goodbye` are subcommands of `cli`. 4. Finally, in the `if __name__ == '__main__':` block, we run `cli()`, our main group. **Let's run this!** Save it as `multi_app.py`. First, check the main help screen: ```bash $ python multi_app.py --help Usage: multi_app.py [OPTIONS] COMMAND [ARGS]... A simple tool with multiple commands. Options: --help Show this message and exit. Commands: goodbye Says Goodbye World hello Says Hello World ``` Look! Click now lists `goodbye` and `hello` under "Commands". It automatically figured out their names from the function names (`goodbye`, `hello`) and their help text from their docstrings. Now, run the specific commands: ```bash $ python multi_app.py hello Hello World! $ python multi_app.py goodbye Goodbye World! ``` You've successfully created a multi-command CLI tool! *(Self-promotion: There's an even shorter way to attach commands using decorators directly on the group, which we'll see in [Decorators](02_decorators.md)!)* ## How It Works Under the Hood What's really happening when you use `@click.command()` or `@click.group()`? 1. **Decoration:** The decorator (`@click.command` or `@click.group`) takes your Python function (`hello`, `goodbye`, `cli`). It wraps this function inside a Click object – either a `Command` instance or a `Group` instance (which is actually a special type of `Command`). These objects store your original function as the `callback` to be executed later. They also store metadata like the command name (derived from the function name) and the help text (from the docstring). You can find the code for these decorators in `decorators.py` and the `Command`/`Group` classes in `core.py`. 2. **Execution:** When you run `python multi_app.py hello`, Python executes the `cli()` call at the bottom. Since `cli` is a `Group` object created by Click, it knows how to parse the command-line arguments (`hello` in this case). 3. **Parsing & Dispatch:** The `cli` group looks at the first argument (`hello`). It checks its list of registered subcommands (which we added using `cli.add_command`). It finds a match with the `hello` command object. 4. **Callback:** The `cli` group then invokes the `hello` command object. The `hello` command object, in turn, calls the original Python function (`hello()`) that it stored earlier as its `callback`. Here's a simplified view of what happens when you run `python multi_app.py hello`: ```mermaid sequenceDiagram participant User participant Terminal participant PythonScript (multi_app.py) participant ClickRuntime participant cli_Group as cli (Group Object) participant hello_Command as hello (Command Object) User->>Terminal: python multi_app.py hello Terminal->>PythonScript: Executes script with args ["hello"] PythonScript->>ClickRuntime: Calls cli() entry point ClickRuntime->>cli_Group: Asks to handle args ["hello"] cli_Group->>cli_Group: Parses args, identifies "hello" as subcommand cli_Group->>hello_Command: Invokes the 'hello' command hello_Command->>hello_Command: Executes its callback (the original hello() function) hello_Command-->>PythonScript: Prints "Hello World!" PythonScript-->>Terminal: Shows output Terminal-->>User: Displays "Hello World!" ``` This process of parsing arguments and calling the right function based on the command structure is the core job of Click, making it easy for *you* to just focus on writing the functions for each command. ## Conclusion You've learned about the two most fundamental concepts in Click: * `Command`: Represents a single action, created by decorating a function with `@click.command()`. * `Group`: Acts as a container for multiple commands (or other groups), created with `@click.group()`. Groups allow you to structure your CLI application logically. We saw how Click uses decorators to transform simple Python functions into powerful command-line interface components, automatically handling things like help text generation and command dispatching. Commands and Groups form the basic structure, but how do we pass information *into* our commands (like `git commit -m "My message"`)? And what other cool things can decorators do? We'll explore that starting with a deeper look at decorators in the next chapter! Next up: [Chapter 2: Decorators](02_decorators.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Click/02_decorators.md ================================================ --- layout: default title: "Decorators" parent: "Click" nav_order: 2 --- # Chapter 2: Decorators: Magic Wands for Your Functions In [Chapter 1: Commands and Groups](01_command___group.md), we learned how to create basic command-line actions (`Command`) and group them together (`Group`). You might have noticed those strange `@click.command()` and `@click.group()` lines above our functions. What are they, and why do we use them? Those are **Decorators**, and they are the heart of how you build Click applications! Think of them as special annotations or modifiers you place *on top* of your Python functions to give them command-line superpowers. ## Why Decorators? Making Life Easier Imagine you didn't have decorators. To create a simple command like `hello` from Chapter 1, you might have to write something like this (this is *not* real Click code, just an illustration): ```python # NOT how Click works, but imagine... import click def hello_logic(): """My command's help text""" print("Hello World!") # Manually create a Command object hello_command = click.Command( name='hello', # Give it a name callback=hello_logic, # Tell it which function to run help=hello_logic.__doc__ # Copy the help text ) if __name__ == '__main__': # Manually parse arguments and run # (This part would be complex!) pass ``` That looks like a lot more work! You have to: 1. Write the function (`hello_logic`). 2. Manually create a `Command` object. 3. Explicitly tell the `Command` object its name, which function to run (`callback`), and its help text. Now, let's remember the Click way from Chapter 1: ```python # The actual Click way import click @click.command() # <-- The Decorator! def hello(): """A simple command that says Hello World""" print("Hello World!") if __name__ == '__main__': hello() ``` Much cleaner, right? The `@click.command()` decorator handles creating the `Command` object, figuring out the name (`hello`), and grabbing the help text from the docstring (`"""..."""`) all automatically! Decorators let you *declare* what you want ("this function is a command") right next to the function's code, making your CLI definition much more readable and concise. ## What is a Decorator in Python? (A Quick Peek) Before diving deeper into Click's decorators, let's understand what a decorator *is* in Python itself. In Python, a decorator is essentially a function that takes another function as input and returns a *modified* version of that function. It's like wrapping a gift: you still have the original gift inside, but the wrapping adds something extra. The `@` symbol is just syntactic sugar – a shortcut – for applying a decorator. Here's a super simple example (not using Click): ```python # A simple Python decorator def simple_decorator(func): def wrapper(): print("Something is happening before the function is called.") func() # Call the original function print("Something is happening after the function is called.") return wrapper # Return the modified function @simple_decorator # Apply the decorator def say_whee(): print("Whee!") # Now, when we call say_whee... say_whee() ``` Running this would print: ``` Something is happening before the function is called. Whee! Something is happening after the function is called. ``` See? `simple_decorator` took our `say_whee` function and wrapped it with extra print statements. The `@simple_decorator` line is equivalent to writing `say_whee = simple_decorator(say_whee)` after defining `say_whee`. Click's decorators (`@click.command`, `@click.group`, etc.) do something similar, but instead of just printing, they wrap your function inside Click's `Command` or `Group` objects and configure them. ## Click's Main Decorators Click provides several decorators. The most common ones you'll use are: * `@click.command()`: Turns a function into a single CLI command. * `@click.group()`: Turns a function into a container for other commands. * `@click.option()`: Adds an *option* (like `--name` or `-v`) to your command. Options are typically optional parameters. * `@click.argument()`: Adds an *argument* (like a required filename) to your command. Arguments are typically required and positional. We already saw `@click.command` and `@click.group` in Chapter 1. Let's focus on how decorators streamline adding commands to groups and introduce options. ## Decorators in Action: Simplifying Groups and Adding Options Remember the `multi_app.py` example from Chapter 1? We had to define the group `cli` and the commands `hello` and `goodbye` separately, then manually attach them using `cli.add_command()`. ```python # multi_app_v1.py (from Chapter 1) import click @click.group() def cli(): """A simple tool with multiple commands.""" pass @click.command() def hello(): """Says Hello World""" print("Hello World!") @click.command() def goodbye(): """Says Goodbye World""" print("Goodbye World!") # Manual attachment cli.add_command(hello) cli.add_command(goodbye) if __name__ == '__main__': cli() ``` Decorators provide a more elegant way! If you have a `@click.group()`, you can use *its* `.command()` method as a decorator to automatically attach the command. Let's rewrite `multi_app.py` using this decorator pattern and also add a simple name option to the `hello` command using `@click.option`: ```python # multi_app_v2.py (using decorators more effectively) import click # 1. Create the main group @click.group() def cli(): """A simple tool with multiple commands.""" pass # Group function still doesn't need to do much # 2. Define 'hello' and attach it to 'cli' using a decorator @cli.command() # <-- Decorator from the 'cli' group object! @click.option('--name', default='World', help='Who to greet.') def hello(name): # The 'name' parameter matches the option """Says Hello""" print(f"Hello {name}!") # 3. Define 'goodbye' and attach it to 'cli' using a decorator @cli.command() # <-- Decorator from the 'cli' group object! def goodbye(): """Says Goodbye""" print("Goodbye World!") # No need for cli.add_command() anymore! if __name__ == '__main__': cli() ``` What changed? 1. Instead of `@click.command()`, we used `@cli.command()` above `hello` and `goodbye`. This tells Click, "This function is a command, *and* it belongs to the `cli` group." No more manual `cli.add_command()` needed! 2. We added `@click.option('--name', default='World', help='Who to greet.')` right below `@cli.command()` for the `hello` function. This adds a command-line option named `--name`. 3. The `hello` function now accepts an argument `name`. Click automatically passes the value provided via the `--name` option to this function parameter. If the user doesn't provide `--name`, it uses the `default='World'`. **Let's run this new version:** Check the help for the main command: ```bash $ python multi_app_v2.py --help Usage: multi_app_v2.py [OPTIONS] COMMAND [ARGS]... A simple tool with multiple commands. Options: --help Show this message and exit. Commands: goodbye Says Goodbye hello Says Hello ``` Now check the help for the `hello` subcommand: ```bash $ python multi_app_v2.py hello --help Usage: multi_app_v2.py hello [OPTIONS] Says Hello Options: --name TEXT Who to greet. [default: World] --help Show this message and exit. ``` See? The `--name` option is listed, along with its help text and default value! Finally, run `hello` with and without the option: ```bash $ python multi_app_v2.py hello Hello World! $ python multi_app_v2.py hello --name Alice Hello Alice! ``` It works! Decorators made adding the command to the group cleaner, and adding the option was as simple as adding another decorator line and a function parameter. We'll learn much more about configuring options and arguments in the next chapter, [Parameter (Option / Argument)](03_parameter__option___argument_.md). ## How Click Decorators Work (Under the Hood) So what's the "magic" behind these `@` symbols in Click? 1. **Decorator Functions:** When you write `@click.command()` or `@click.option()`, you're calling functions defined in Click (specifically in `decorators.py`). These functions are designed to *return another function* (the actual decorator). 2. **Wrapping the User Function:** Python takes the function you defined (e.g., `hello`) and passes it to the decorator function returned in step 1. 3. **Attaching Information:** * `@click.option` / `@click.argument`: These decorators typically don't create the final `Command` object immediately. Instead, they attach the parameter information (like the option name `--name`, type, default value) to your function object itself, often using a special temporary attribute (like `__click_params__`). They then return the *original function*, but now with this extra metadata attached. * `@click.command` / `@click.group`: This decorator usually runs *last* (decorators are applied bottom-up). It looks for any parameter information attached by previous `@option` or `@argument` decorators (like `__click_params__`). It then creates the actual `Command` or `Group` object (defined in `core.py`), configures it with the command name, help text (from the docstring), the attached parameters, and stores your original function as the `callback` to be executed. It returns this newly created `Command` or `Group` object, effectively replacing your original function definition with the Click object. 4. **Group Attachment:** When you use `@cli.command()`, the `@cli.command()` decorator not only creates the `Command` object but also automatically calls `cli.add_command()` to register the new command with the `cli` group object. Here's a simplified sequence diagram showing what happens when you define the `hello` command in `multi_app_v2.py`: ```mermaid sequenceDiagram participant PythonInterpreter participant click_option as @click.option('--name') participant hello_func as hello(name) participant cli_command as @cli.command() participant cli_Group as cli (Group Object) participant hello_Command as hello (New Command Object) Note over PythonInterpreter, hello_func: Python processes decorators bottom-up PythonInterpreter->>click_option: Processes @click.option('--name', ...) decorator click_option->>hello_func: Attaches Option info (like in __click_params__) click_option-->>PythonInterpreter: Returns original hello_func (with attached info) PythonInterpreter->>cli_command: Processes @cli.command() decorator cli_command->>hello_func: Reads function name, docstring, attached params (__click_params__) cli_command->>hello_Command: Creates new Command object for 'hello' cli_command->>cli_Group: Calls cli.add_command(hello_Command) cli_command-->>PythonInterpreter: Returns the new hello_Command object Note over PythonInterpreter: 'hello' in the code now refers to the Command object ``` The key takeaway is that decorators allow Click to gather all the necessary information (function logic, command name, help text, options, arguments) right where you define the function, and build the corresponding Click objects behind the scenes. You can find the implementation details in `click/decorators.py` and `click/core.py`. The `_param_memo` helper function in `decorators.py` is often used internally by `@option` and `@argument` to attach parameter info to the function before `@command` processes it. ## Conclusion Decorators are fundamental to Click's design philosophy. They provide a clean, readable, and *declarative* way to turn your Python functions into powerful command-line interface components. You've learned: * Decorators are Python features (`@`) that modify functions. * Click uses decorators like `@click.command`, `@click.group`, `@click.option`, and `@click.argument` extensively. * Decorators handle the creation and configuration of `Command`, `Group`, `Option`, and `Argument` objects for you. * Using decorators like `@group.command()` automatically attaches commands to groups. * They make defining your CLI structure intuitive and keep related code together. We've only scratched the surface of `@click.option` and `@click.argument`. How do you make options required? How do you handle different data types (numbers, files)? How do you define arguments that take multiple values? We'll explore all of this in the next chapter! Next up: [Chapter 3: Parameter (Option / Argument)](03_parameter__option___argument_.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Click/03_parameter__option___argument_.md ================================================ --- layout: default title: "Parameter (Option & Argument)" parent: "Click" nav_order: 3 --- # Chapter 3: Parameter (Option / Argument) - Giving Your Commands Input In the last chapter, [Decorators](02_decorators.md), we saw how decorators like `@click.command()` and `@click.option()` act like magic wands, transforming our Python functions into CLI commands and adding features like command-line options. But how do our commands actually *receive* information from the user? If we have a command `greet`, how do we tell it *who* to greet, like `greet --name Alice`? Or if we have a `copy` command, how do we specify the source and destination files, like `copy report.txt backup.txt`? This is where **Parameters** come in. Parameters define the inputs your commands can accept, just like arguments define the inputs for a regular Python function. Click handles parsing these inputs from the command line, validating them, and making them available to your command function. There are two main types of parameters in Click: 1. **Options:** These are usually preceded by flags like `--verbose` or `-f`. They are often optional and can either take a value (like `--name Alice`) or act as simple on/off switches (like `--verbose`). You define them using the `@click.option()` decorator. 2. **Arguments:** These are typically positional values that come *after* any options. They often represent required inputs, like a filename (`report.txt`). You define them using the `@click.argument()` decorator. Let's see how to use them! ## Options: The Named Inputs (`@click.option`) Think of options like keyword arguments in Python functions. In `def greet(name="World"):`, `name` is a keyword argument with a default value. Options serve a similar purpose for your CLI. Let's modify our `hello` command from the previous chapter to accept a `--name` option. ```python # greet_app.py import click @click.group() def cli(): """A simple tool with a greeting command.""" pass @cli.command() @click.option('--name', default='World', help='Who to greet.') def hello(name): # <-- The 'name' parameter matches the option """Greets the person specified by the --name option.""" print(f"Hello {name}!") if __name__ == '__main__': cli() ``` Let's break down the new parts: 1. `@click.option('--name', default='World', help='Who to greet.')`: This decorator defines an option. * `'--name'`: This is the primary name of the option on the command line. * `default='World'`: If the user doesn't provide the `--name` option, the value `World` will be used. * `help='Who to greet.'`: This text will appear in the help message for the `hello` command. 2. `def hello(name):`: Notice how the `hello` function now accepts an argument named `name`. Click cleverly matches the option name (`name`) to the function parameter name and passes the value automatically! **Try running it!** First, check the help message for the `hello` command: ```bash $ python greet_app.py hello --help Usage: greet_app.py hello [OPTIONS] Greets the person specified by the --name option. Options: --name TEXT Who to greet. [default: World] --help Show this message and exit. ``` See? Click added our `--name` option to the help screen, including the help text and default value we provided. The `TEXT` part indicates the type of value expected (we'll cover types in [ParamType](04_paramtype.md)). Now, run it with and without the option: ```bash $ python greet_app.py hello Hello World! $ python greet_app.py hello --name Alice Hello Alice! ``` It works perfectly! Click parsed the `--name Alice` option and passed `"Alice"` to our `hello` function's `name` parameter. When we didn't provide the option, it used the default value `"World"`. ### Option Flavors: Short Names and Flags Options can have variations: * **Short Names:** You can provide shorter aliases, like `-n` for `--name`. * **Flags:** Options that don't take a value but act as switches (e.g., `--verbose`). Let's add a short name `-n` to our `--name` option and a `--shout` flag to make the greeting uppercase. ```python # greet_app_v2.py import click @click.group() def cli(): """A simple tool with a greeting command.""" pass @cli.command() @click.option('--name', '-n', default='World', help='Who to greet.') # Added '-n' @click.option('--shout', is_flag=True, help='Greet loudly.') # Added '--shout' flag def hello(name, shout): # <-- Function now accepts 'shout' too """Greets the person, optionally shouting.""" greeting = f"Hello {name}!" if shout: greeting = greeting.upper() print(greeting) if __name__ == '__main__': cli() ``` Changes: 1. `@click.option('--name', '-n', ...)`: We added `'-n'` as the second argument to the decorator. Now, both `--name` and `-n` work. 2. `@click.option('--shout', is_flag=True, ...)`: This defines a flag. `is_flag=True` tells Click this option doesn't take a value; its presence makes the corresponding parameter `True`, otherwise it's `False`. 3. `def hello(name, shout):`: The function signature is updated to accept the `shout` parameter. **Run it again!** ```bash $ python greet_app_v2.py hello -n Bob Hello Bob! $ python greet_app_v2.py hello --name Carol --shout HELLO CAROL! $ python greet_app_v2.py hello --shout HELLO WORLD! ``` Flags and short names make your CLI more flexible and conventional! ## Arguments: The Positional Inputs (`@click.argument`) Arguments are like positional arguments in Python functions. In `def copy(src, dst):`, `src` and `dst` are required positional arguments. Click arguments usually represent mandatory inputs that follow the command and any options. Let's create a simple command that takes two arguments, `SRC` and `DST`, representing source and destination files (though we'll just print them for now). ```python # copy_app.py import click @click.command() @click.argument('src') # Defines the first argument @click.argument('dst') # Defines the second argument def copy(src, dst): # Function parameters match argument names """Copies SRC file to DST.""" print(f"Pretending to copy '{src}' to '{dst}'") if __name__ == '__main__': copy() ``` What's happening here? 1. `@click.argument('src')`: Defines a positional argument named `src`. By default, arguments are required. The name `'src'` is used both internally and often capitalized (`SRC`) in help messages by convention. 2. `@click.argument('dst')`: Defines the second required positional argument. 3. `def copy(src, dst):`: The function parameters `src` and `dst` receive the values provided on the command line in the order they appear. **Let's try it!** First, see what happens if we forget the arguments: ```bash $ python copy_app.py Usage: copy_app.py [OPTIONS] SRC DST Try 'copy_app.py --help' for help. Error: Missing argument 'SRC'. ``` Click automatically detects the missing argument and gives a helpful error message! Now, provide the arguments: ```bash $ python copy_app.py report.txt backup/report.txt Pretending to copy 'report.txt' to 'backup/report.txt' ``` Click correctly captured the positional arguments and passed them to our `copy` function. Arguments are essential for inputs that are fundamental to the command's operation, like the files to operate on. Options are better suited for modifying the command's behavior. *(Note: Arguments can also be made optional or accept variable numbers of inputs, often involving the `required` and `nargs` settings, which tie into concepts we'll explore more in [ParamType](04_paramtype.md).)* ## How Parameters Work Together When you run a command like `python greet_app_v2.py hello --shout -n Alice`, Click performs a sequence of steps: 1. **Parsing:** Click looks at the command-line arguments (`sys.argv`) provided by the operating system: `['greet_app_v2.py', 'hello', '--shout', '-n', 'Alice']`. 2. **Command Identification:** It identifies `hello` as the command to execute. 3. **Parameter Matching:** It scans the remaining arguments (`['--shout', '-n', 'Alice']`). * It sees `--shout`. It looks up the parameters defined for the `hello` command (using the `@click.option` and `@click.argument` decorators). It finds the `shout` option definition (which has `is_flag=True`). It marks the value for `shout` as `True`. * It sees `-n`. It finds the `name` option definition (which includes `-n` as an alias and expects a value). * It sees `Alice`. Since the previous token (`-n`) expected a value, Click associates `"Alice"` with the `-n` (and thus `--name`) option. It marks the value for `name` as `"Alice"`. 4. **Validation & Conversion:** Click checks if all required parameters are present (they are). It also performs type conversion (though in this case, the default is string, which matches "Alice"). We'll see more complex conversions in the next chapter. 5. **Function Call:** Finally, Click calls the command's underlying Python function (`hello`) with the collected values as keyword arguments: `hello(name='Alice', shout=True)`. Here's a simplified view of the process: ```mermaid sequenceDiagram participant User participant Terminal participant PythonScript as python greet_app_v2.py participant ClickRuntime participant hello_func as hello(name, shout) User->>Terminal: python greet_app_v2.py hello --shout -n Alice Terminal->>PythonScript: Executes script with args ["hello", "--shout", "-n", "Alice"] PythonScript->>ClickRuntime: Calls cli() entry point ClickRuntime->>ClickRuntime: Parses args, finds 'hello' command ClickRuntime->>ClickRuntime: Identifies '--shout' as flag for 'shout' parameter (value=True) ClickRuntime->>ClickRuntime: Identifies '-n' as option for 'name' parameter ClickRuntime->>ClickRuntime: Consumes 'Alice' as value for '-n'/'name' parameter (value="Alice") ClickRuntime->>ClickRuntime: Validates parameters, performs type conversion ClickRuntime->>hello_func: Calls callback: hello(name="Alice", shout=True) hello_func-->>PythonScript: Prints "HELLO ALICE!" PythonScript-->>Terminal: Shows output Terminal-->>User: Displays "HELLO ALICE!" ``` ## Under the Hood: Decorators and Parameter Objects How do `@click.option` and `@click.argument` actually work with `@click.command`? 1. **Parameter Definition (`decorators.py`, `core.py`):** When you use `@click.option(...)` or `@click.argument(...)`, these functions (defined in `click/decorators.py`) create instances of the `Option` or `Argument` classes (defined in `click/core.py`). These objects store all the configuration you provided (like `--name`, `-n`, `default='World'`, `is_flag=True`, etc.). 2. **Attaching to Function (`decorators.py`):** Crucially, these decorators don't immediately add the parameters to a command. Instead, they attach the created `Option` or `Argument` object to the function they are decorating. Click uses a helper mechanism (like the internal `_param_memo` function which adds to a `__click_params__` list) to store these parameter objects *on* the function object temporarily. 3. **Command Creation (`decorators.py`, `core.py`):** The `@click.command()` decorator (or `@group.command()`) runs *after* all the `@option` and `@argument` decorators for that function. It looks for the attached parameter objects (the `__click_params__` list). It gathers these objects and passes them to the constructor of the `Command` (or `Group`) object it creates. The `Command` object stores these parameters in its `params` attribute. 4. **Parsing (`parser.py`, `core.py`):** When the command is invoked, the `Command` object uses its `params` list to configure an internal parser (historically based on Python's `optparse`, see `click/parser.py`). This parser processes the command-line string (`sys.argv`) according to the rules defined by the `Option` and `Argument` objects in the `params` list. 5. **Callback Invocation (`core.py`):** After parsing and validation, Click takes the resulting values and calls the original Python function (stored as the `Command.callback`), passing the values as arguments. So, the decorators work together: `@option`/`@argument` define the parameters and temporarily attach them to the function, while `@command` collects these definitions and builds the final `Command` object, ready for parsing. ## Conclusion You've learned how to make your Click commands interactive by defining inputs using **Parameters**: * **Options (`@click.option`):** Named inputs, often optional, specified with flags (`--name`, `-n`). Great for controlling behavior (like `--verbose`, `--shout`) or providing specific pieces of data (`--output file.txt`). * **Arguments (`@click.argument`):** Positional inputs, often required, that follow options (`input.csv`). Ideal for core data the command operates on (like source/destination files). You saw how Click uses decorators to define these parameters and automatically handles parsing the command line, providing default values, generating help messages, and passing the final values to your Python function. But what if you want an option to accept only numbers? Or a choice from a predefined list? Or maybe an argument that represents a file path that must exist? Click handles this through **Parameter Types**. Let's explore those next! Next up: [Chapter 4: ParamType](04_paramtype.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Click/04_paramtype.md ================================================ --- layout: default title: "ParamType" parent: "Click" nav_order: 4 --- # Chapter 4: ParamType - Checking and Converting Inputs In [Chapter 3: Parameter (Option / Argument)](03_parameter__option___argument_.md), we learned how to define inputs for our commands using `@click.option` and `@click.argument`. Our `greet` command could take a `--name` option, and our `copy` command took `SRC` and `DST` arguments. But what if we need more control? What if our command needs a *number* as input, like `--count 3`? Or what if an option should only accept specific words, like `--level easy` or `--level hard`? Right now, Click treats most inputs as simple text strings. This is where **ParamType** comes in! Think of `ParamType`s as the **gatekeepers** and **translators** for your command-line inputs. They: 1. **Validate:** Check if the user's input looks correct (e.g., "Is this actually a number?"). 2. **Convert:** Change the input text (which is always initially a string) into the Python type you need (e.g., the string `"3"` becomes the integer `3`). `ParamType`s make your commands more robust by catching errors early and giving your Python code the data types it expects. ## Why Do We Need ParamTypes? Imagine you're writing a command to repeat a message multiple times: ```bash repeat --times 5 "Hello!" ``` Inside your Python function, you want the `times` variable to be an integer so you can use it in a loop. If the user types `repeat --times five "Hello!"`, your code might crash if it tries to use the string `"five"` like a number. `ParamType` solves this. By telling Click that the `--times` option expects an integer, Click will automatically: * Check if the input (`"5"`) can be turned into an integer. * If yes, convert it to the integer `5` and pass it to your function. * If no (like `"five"`), stop immediately and show the user a helpful error message *before* your function even runs! ## Using Built-in ParamTypes Click provides several ready-to-use `ParamType`s. You specify which one to use with the `type` argument in `@click.option` or `@click.argument`. Let's modify an example to use `click.INT`. ```python # count_app.py import click @click.command() @click.option('--count', default=1, type=click.INT, help='Number of times to print.') @click.argument('message') def repeat(count, message): """Prints MESSAGE the specified number of times.""" # 'count' is now guaranteed to be an integer! for _ in range(count): click.echo(message) if __name__ == '__main__': repeat() ``` Breakdown: 1. `import click`: As always. 2. `@click.option('--count', ..., type=click.INT, ...)`: This is the key change! We added `type=click.INT`. This tells Click that the value provided for `--count` must be convertible to an integer. `click.INT` is one of Click's built-in `ParamType` instances. 3. `def repeat(count, message):`: The `count` parameter in our function will receive the *converted* integer value. **Let's run it!** ```bash $ python count_app.py --count 3 "Woohoo!" Woohoo! Woohoo! Woohoo! ``` It works! Click converted the input string `"3"` into the Python integer `3` before calling our `repeat` function. Now, see what happens with invalid input: ```bash $ python count_app.py --count five "Oh no" Usage: count_app.py [OPTIONS] MESSAGE Try 'count_app.py --help' for help. Error: Invalid value for '--count': 'five' is not a valid integer. ``` Perfect! Click caught the error because `"five"` couldn't be converted by `click.INT`. It printed a helpful message and prevented our `repeat` function from running with bad data. ## Common Built-in Types Click offers several useful built-in types: * `click.STRING`: The default type. Converts the input to a string (usually doesn't change much unless the input was bytes). * `click.INT`: Converts to an integer. Fails if the input isn't a valid whole number. * `click.FLOAT`: Converts to a floating-point number. Fails if the input isn't a valid number (e.g., `3.14`, `-0.5`). * `click.BOOL`: Converts to a boolean (`True`/`False`). It's clever and understands inputs like `'1'`, `'true'`, `'t'`, `'yes'`, `'y'`, `'on'` as `True`, and `'0'`, `'false'`, `'f'`, `'no'`, `'n'`, `'off'` as `False`. Usually used for options that aren't flags. * `click.Choice`: Checks if the value is one of a predefined list of choices. ```python # choice_example.py import click @click.command() @click.option('--difficulty', type=click.Choice(['easy', 'medium', 'hard'], case_sensitive=False), default='easy') def setup(difficulty): click.echo(f"Setting up game with difficulty: {difficulty}") if __name__ == '__main__': setup() ``` Running `python choice_example.py --difficulty MeDiUm` works (because `case_sensitive=False`), but `python choice_example.py --difficulty expert` would fail. * `click.Path`: Represents a filesystem path. It can check if the path exists, if it's a file or directory, and if it has certain permissions (read/write/execute). It returns the path as a string (or `pathlib.Path` if configured). ```python # path_example.py import click @click.command() @click.argument('output_dir', type=click.Path(exists=True, file_okay=False, dir_okay=True, writable=True)) def process(output_dir): click.echo(f"Processing data into directory: {output_dir}") # We know output_dir exists, is a directory, and is writable! if __name__ == '__main__': process() ``` * `click.File`: Similar to `Path`, but it *automatically opens* the file and passes the open file object to your function. It also handles closing the file automatically. You can specify the mode (`'r'`, `'w'`, `'rb'`, `'wb'`). ```python # file_example.py import click @click.command() @click.argument('input_file', type=click.File('r')) # Open for reading text def cat(input_file): # input_file is an open file handle! click.echo(input_file.read()) # Click will close the file automatically after this function returns if __name__ == '__main__': cat() ``` These built-in types cover most common use cases for validating and converting command-line inputs. ## How ParamTypes Work Under the Hood What happens when you specify `type=click.INT`? 1. **Parsing:** As described in [Chapter 3](03_parameter__option___argument_.md), Click's parser identifies the command-line arguments and matches them to your defined `Option`s and `Argument`s. It finds the raw string value provided by the user (e.g., `"3"` for `--count`). 2. **Type Retrieval:** The parser looks at the `Parameter` object (the `Option` or `Argument`) and finds the `type` you assigned to it (e.g., the `click.INT` instance). 3. **Conversion Attempt:** The parser calls the `convert()` method of the `ParamType` instance, passing the raw string value (`"3"`), the parameter object itself, and the current [Context](05_context.md). 4. **Validation & Conversion Logic (Inside `ParamType.convert`)**: * The `click.INT.convert()` method tries to call Python's built-in `int("3")`. * If this succeeds, it returns the result (the integer `3`). * If it fails (e.g., `int("five")` would raise a `ValueError`), the `convert()` method catches this error. 5. **Success or Failure**: * **Success:** The parser receives the converted value (`3`) and stores it. Later, it passes this value to your command function. * **Failure:** The `convert()` method calls its `fail()` helper method. The `fail()` method raises a `click.BadParameter` exception with a helpful error message (e.g., "'five' is not a valid integer."). Click catches this exception, stops further processing, and displays the error message to the user along with usage instructions. Here's a simplified view of the successful conversion process: ```mermaid sequenceDiagram participant User participant CLI participant ClickParser as Click Parser participant IntType as click.INT participant CommandFunc as Command Function User->>CLI: python count_app.py --count 3 ... CLI->>ClickParser: Parse args, find '--count' option with value '3' ClickParser->>IntType: Call convert(value='3', param=..., ctx=...) IntType->>IntType: Attempt int('3') -> Success! returns 3 IntType-->>ClickParser: Return converted value: 3 ClickParser->>CommandFunc: Call repeat(count=3, ...) CommandFunc-->>CLI: Executes logic (prints message 3 times) ``` And here's the failure process: ```mermaid sequenceDiagram participant User participant CLI participant ClickParser as Click Parser participant IntType as click.INT participant ClickException as Click Exception Handling User->>CLI: python count_app.py --count five ... CLI->>ClickParser: Parse args, find '--count' option with value 'five' ClickParser->>IntType: Call convert(value='five', param=..., ctx=...) IntType->>IntType: Attempt int('five') -> Fails! (ValueError) IntType->>ClickException: Catch error, call fail("'five' is not...") -> raises BadParameter ClickException-->>ClickParser: BadParameter exception raised ClickParser-->>CLI: Catch exception, stop processing CLI-->>User: Display "Error: Invalid value for '--count': 'five' is not a valid integer." ``` The core logic for built-in types resides in `click/types.py`. Each type (like `IntParamType`, `Choice`, `Path`) inherits from the base `ParamType` class and implements its own `convert` method containing the specific validation and conversion rules. ```python # Simplified structure from click/types.py class ParamType: name: str # Human-readable name like "integer" or "filename" def convert(self, value, param, ctx): # Must be implemented by subclasses # Should return the converted value or call self.fail() raise NotImplementedError def fail(self, message, param, ctx): # Raises a BadParameter exception raise BadParameter(message, ctx=ctx, param=param) class IntParamType(ParamType): name = "integer" def convert(self, value, param, ctx): try: # The core conversion logic! return int(value) except ValueError: # If conversion fails, raise the standard error self.fail(f"{value!r} is not a valid integer.", param, ctx) # click.INT is just an instance of this class INT = IntParamType() ``` ## Custom Types What if none of the built-in types do exactly what you need? Click allows you to create your own custom `ParamType`s! You can do this by subclassing `click.ParamType` and implementing the `name` attribute and the `convert` method. This is an advanced topic, but it provides great flexibility. ## Shell Completion Hints An added benefit of using specific `ParamType`s is that they can provide hints for shell completion (when the user presses Tab). For example: * `click.Choice(['easy', 'medium', 'hard'])` can suggest `easy`, `medium`, or `hard`. * `click.Path` can suggest file and directory names from the current location. This makes your CLI even more user-friendly. ## Conclusion `ParamType`s are a fundamental part of Click, acting as the bridge between raw command-line text input and the well-typed data your Python functions need. They handle the crucial tasks of: * **Validating** user input against expected formats or rules. * **Converting** input strings to appropriate Python types (integers, booleans, files, etc.). * **Generating** user-friendly error messages for invalid input. * Providing hints for **shell completion**. By using built-in types like `click.INT`, `click.Choice`, `click.Path`, and `click.File`, you make your commands more robust, reliable, and easier to use. So far, we've seen how commands are structured, how parameters get their values, and how those values are validated and converted. But how does Click manage the state *during* the execution of a command? How does it know which command is running or what the parent commands were? That's the job of the `Context`. Let's explore that next! Next up: [Chapter 5: Context](05_context.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Click/05_context.md ================================================ --- layout: default title: "Context" parent: "Click" nav_order: 5 --- # Chapter 5: Context - The Command's Nervous System In the last chapter, [ParamType](04_paramtype.md), we saw how Click helps validate and convert user input into the right Python types, making our commands more robust. We used types like `click.INT` and `click.Path` to ensure data correctness. But what happens *while* a command is running? How does Click keep track of which command is being executed, what parameters were passed, or even shared information between different commands in a nested structure (like `git remote add ...`)? This is where the **Context** object, often referred to as `ctx`, comes into play. Think of the Context as the central nervous system for a single command invocation. It carries all the vital information about the current state of execution. ## Why Do We Need a Context? Imagine you have a command that needs to behave differently based on a global configuration, maybe a `--verbose` flag set on the main application group. Or perhaps one command needs to call another command within the same application. How do they communicate? The Context object solves these problems by providing a central place to: * Access parameters passed to the *current* command. * Access parameters or settings from *parent* commands. * Share application-level objects (like configuration settings or database connections) between commands. * Manage resources that need cleanup (like automatically closing files opened with `click.File`). * Invoke other commands programmatically. Let's explore how to access and use this powerful object. ## Getting the Context: `@pass_context` Click doesn't automatically pass the Context object to your command function. You need to explicitly ask for it using a special decorator: `@click.pass_context`. When you add `@click.pass_context` *above* your function definition (but typically *below* the `@click.command` or `@click.option` decorators), Click will automatically **inject** the `Context` object as the **very first argument** to your function. Let's see a simple example: ```python # context_basics.py import click @click.group() @click.pass_context # Request the context for the group function def cli(ctx): """A simple CLI with context.""" # We can store arbitrary data on the context's 'obj' attribute ctx.obj = {'verbose': False} # Initialize a shared dictionary @cli.command() @click.option('--verbose', is_flag=True, help='Enable verbose mode.') @click.pass_context # Request the context for the command function def info(ctx, verbose): """Prints info, possibly verbosely.""" # Access the command name from the context click.echo(f"Executing command: {ctx.command.name}") # Access parameters passed to *this* command click.echo(f"Verbose flag (local): {verbose}") # We can modify the shared object from the parent context if verbose: ctx.obj['verbose'] = True # Access the shared object from the parent context click.echo(f"Verbose setting (shared): {ctx.obj['verbose']}") if __name__ == '__main__': cli() ``` Let's break it down: 1. `@click.pass_context`: We apply this decorator to both the `cli` group function and the `info` command function. 2. `def cli(ctx): ...`: Because of `@pass_context`, the `cli` function now receives the `Context` object as its first argument, which we've named `ctx`. 3. `ctx.obj = {'verbose': False}`: The `ctx.obj` attribute is a special place designed for you to store and share *your own* application data. Here, the main `cli` group initializes it as a dictionary. This object will be automatically inherited by child command contexts. 4. `def info(ctx, verbose): ...`: The `info` command function also receives the `Context` (`ctx`) as its first argument, followed by its own parameters (`verbose`). 5. `ctx.command.name`: We access the `Command` object associated with the current context via `ctx.command` and get its name. 6. `ctx.obj['verbose'] = True`: We can *modify* the shared `ctx.obj` from within the subcommand. 7. `click.echo(f"Verbose setting (shared): {ctx.obj['verbose']}")`: We access the potentially modified shared state. **Run it!** ```bash $ python context_basics.py info Executing command: info Verbose flag (local): False Verbose setting (shared): False $ python context_basics.py info --verbose Executing command: info Verbose flag (local): True Verbose setting (shared): True ``` You can see how `@pass_context` gives us access to the runtime environment (`ctx.command.name`) and allows us to use `ctx.obj` to share state between the parent group (`cli`) and the subcommand (`info`). ## Key Context Attributes The `Context` object has several useful attributes: * `ctx.command`: The [Command](01_command___group.md) object that this context belongs to. You can get its name (`ctx.command.name`), parameters, etc. * `ctx.parent`: The context of the invoking command. If this is the top-level command, `ctx.parent` will be `None`. This forms a linked list or chain back to the root context. * `ctx.params`: A dictionary mapping parameter names to the *final* values passed to the command, after parsing, type conversion, and defaults have been applied. ```python # access_params.py import click @click.command() @click.option('--name', default='Guest') @click.pass_context def hello(ctx, name): click.echo(f"Hello, {name}!") # Access the parameter value directly via ctx.params click.echo(f"(Value from ctx.params: {ctx.params['name']})") if __name__ == '__main__': hello() ``` Running `python access_params.py --name Alice` would show `Hello, Alice!` and `(Value from ctx.params: Alice)`. * `ctx.obj`: As seen before, this is an arbitrary object that gets passed down the context chain. It's commonly used for shared configuration, database connections, or other application-level state. You can also use `@click.pass_obj` as a shortcut if you *only* need `ctx.obj`. * `ctx.info_name`: The name that was used on the command line to invoke this command or group (e.g., `info` in `python context_basics.py info`). * `ctx.invoked_subcommand`: For groups, this holds the name of the subcommand that was invoked (or `None` if no subcommand was called). ## Calling Other Commands Sometimes, you want one command to trigger another. The Context provides methods for this: * `ctx.invoke(other_command, **params)`: Calls another Click command (`other_command`), passing the current context's parent (`ctx.parent`) as the new command's parent. It uses the provided `params` for the call. * `ctx.forward(other_command)`: Similar to `invoke`, but it automatically passes all parameters from the *current* context (`ctx.params`) to the `other_command`. This is useful for creating alias commands. ```python # invoke_example.py import click @click.group() def cli(): pass @cli.command() @click.argument('text') def print_it(text): """Prints the given text.""" click.echo(f"Printing: {text}") @cli.command() @click.argument('message') @click.pass_context # Need context to call invoke def shout(ctx, message): """Shouts the message by calling print_it.""" click.echo("About to invoke print_it...") # Call the 'print_it' command, passing the uppercased message ctx.invoke(print_it, text=message.upper()) click.echo("Finished invoking print_it.") if __name__ == '__main__': cli() ``` Running `python invoke_example.py shout "hello world"` will output: ``` About to invoke print_it... Printing: HELLO WORLD Finished invoking print_it. ``` The `shout` command successfully called the `print_it` command programmatically using `ctx.invoke()`. ## Resource Management (`ctx.call_on_close`) Click uses the context internally to manage resources. For instance, when you use `type=click.File('w')`, Click opens the file and registers a cleanup function using `ctx.call_on_close(file.close)`. This ensures the file is closed when the context is finished, even if errors occur. You can use this mechanism yourself if you need custom resource cleanup tied to the command's lifecycle. ```python # resource_management.py import click class MockResource: def __init__(self, name): self.name = name click.echo(f"Resource '{self.name}' opened.") def close(self): click.echo(f"Resource '{self.name}' closed.") @click.command() @click.pass_context def process(ctx): """Opens and closes a mock resource.""" res = MockResource("DataFile") # Register the close method to be called when the context ends ctx.call_on_close(res.close) click.echo("Processing with resource...") # Function ends, context tears down, call_on_close triggers if __name__ == '__main__': process() ``` Running this script will show: ``` Resource 'DataFile' opened. Processing with resource... Resource 'DataFile' closed. ``` The resource was automatically closed because we registered its `close` method with `ctx.call_on_close`. ## How Context Works Under the Hood 1. **Initial Context:** When you run your Click application (e.g., by calling `cli()`), Click creates the first `Context` object associated with the top-level command or group (`cli` in our examples). 2. **Parsing and Subcommand:** Click parses the command-line arguments. If a subcommand is identified (like `info` in `python context_basics.py info`), Click finds the corresponding `Command` object. 3. **Child Context Creation:** Before executing the subcommand's callback function, Click creates a *new* `Context` object for the subcommand. Crucially, it sets the `parent` attribute of this new context to the context of the invoking command (the `cli` context in our example). 4. **Object Inheritance:** The `ctx.obj` attribute is automatically passed down from the parent context to the child context *by reference* (unless the child explicitly sets its own `ctx.obj`). 5. **`@pass_context` Decorator:** This decorator (defined in `decorators.py`) wraps your callback function. When the wrapped function is called, the decorator uses `click.globals.get_current_context()` (which accesses a thread-local stack of contexts) to fetch the *currently active* context and inserts it as the first argument before calling your original function. 6. **`ctx.invoke`:** When you call `ctx.invoke(other_cmd, ...)`, Click finds the `other_cmd` object, creates a *new* context for it (setting its parent to `ctx.parent`), populates its `params` from the arguments you provided, and then executes `other_cmd`'s callback within that new context. 7. **Cleanup:** Once a command function finishes (or raises an exception that Click handles), its corresponding context is "torn down". This is when any functions registered with `ctx.call_on_close` are executed. Here's a simplified diagram showing context creation and `ctx.obj` flow for `python context_basics.py info --verbose`: ```mermaid sequenceDiagram participant User participant CLI as python context_basics.py participant ClickRuntime participant cli_ctx as cli Context participant info_ctx as info Context participant cli_func as cli(ctx) participant info_func as info(ctx, verbose) User->>CLI: info --verbose CLI->>ClickRuntime: Calls cli() entry point ClickRuntime->>cli_ctx: Creates root context for 'cli' group Note over ClickRuntime, cli_func: ClickRuntime calls cli's callback (due to @click.group) ClickRuntime->>cli_func: cli(ctx=cli_ctx) cli_func->>cli_ctx: Sets ctx.obj = {'verbose': False} cli_func-->>ClickRuntime: Returns ClickRuntime->>ClickRuntime: Parses args, finds 'info' subcommand, '--verbose' option ClickRuntime->>info_ctx: Creates child context for 'info' command info_ctx->>cli_ctx: Sets info_ctx.parent = cli_ctx info_ctx->>info_ctx: Inherits ctx.obj from parent (value = {'verbose': False}) Note over ClickRuntime, info_func: ClickRuntime prepares to call info's callback ClickRuntime->>ClickRuntime: Uses @pass_context to get info_ctx ClickRuntime->>info_func: info(ctx=info_ctx, verbose=True) info_func->>info_ctx: Accesses ctx.command.name info_func->>info_ctx: Accesses ctx.params['verbose'] (or local 'verbose') info_func->>info_ctx: Modifies ctx.obj['verbose'] = True info_func->>info_ctx: Accesses ctx.obj['verbose'] (now True) info_func-->>ClickRuntime: Returns ClickRuntime->>info_ctx: Tears down info_ctx (runs call_on_close) ClickRuntime->>cli_ctx: Tears down cli_ctx (runs call_on_close) ClickRuntime-->>CLI: Exits ``` The core `Context` class is defined in `click/core.py`. The decorators `pass_context` and `pass_obj` are in `click/decorators.py`, and the mechanism for tracking the current context is in `click/globals.py`. ## Conclusion The `Context` (`ctx`) is a cornerstone concept in Click, acting as the runtime carrier of information for a command invocation. You've learned: * The Context holds data like the current command, parameters, parent context, and shared application objects (`ctx.obj`). * The `@click.pass_context` decorator injects the current Context into your command function. * `ctx.obj` is essential for sharing state between nested commands. * `ctx.invoke()` and `ctx.forward()` allow commands to call each other programmatically. * Click uses the context for resource management (`ctx.call_on_close`), ensuring cleanup. Understanding the Context is key to building more complex Click applications where commands need to interact with each other or with shared application state. It provides the structure and communication channels necessary for sophisticated CLI tools. So far, we've focused on the logic and structure of commands. But how can we make the interaction in the terminal itself more engaging? How do we prompt users for input, show progress bars, or display colored output? Let's explore Click's terminal UI capabilities next! Next up: [Chapter 6: Term UI (Terminal User Interface)](06_term_ui__terminal_user_interface_.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Click/06_term_ui__terminal_user_interface_.md ================================================ --- layout: default title: "Term UI (Terminal User Interface)" parent: "Click" nav_order: 6 --- # Chapter 6: Term UI (Terminal User Interface) Welcome back! In [Chapter 5: Context](05_context.md), we learned how Click uses the `Context` object (`ctx`) to manage the state of a command while it's running, allowing us to share information and call other commands. So far, our commands have mostly just printed simple text. But what if we want to make our command-line tools more interactive and user-friendly? How can we: * Ask the user for input (like their name or a filename)? * Ask simple yes/no questions? * Show a progress bar for long-running tasks? * Make our output more visually appealing with colors or styles (like making errors red)? This is where Click's **Terminal User Interface (Term UI)** functions come in handy. They are Click's toolkit for talking *back and forth* with the user through the terminal. ## Making Our Tools Talk: The Need for Term UI Imagine you're building a tool that processes a large data file. A purely silent tool isn't very helpful. A better tool might: 1. Ask the user which file to process. 2. Ask for confirmation before starting a potentially long operation. 3. Show a progress bar while processing the data. 4. Print a nice, colored "Success!" message at the end, or a red "Error!" message if something went wrong. Doing all this reliably across different operating systems (like Linux, macOS, and Windows) can be tricky. For example, getting colored text to work correctly on Windows requires special handling. Click's Term UI functions wrap up these common interactive tasks into easy-to-use functions that work consistently everywhere. Let's explore some of the most useful ones! ## Printing with `click.echo()` We've seen `print()` in Python, but Click provides its own version: `click.echo()`. Why use it? * **Smarter:** It works better with different kinds of data (like Unicode text and raw bytes). * **Cross-Platform:** It handles subtle differences between operating systems for you. * **Color Aware:** It automatically strips out color codes if the output isn't going to a terminal (like if you redirect output to a file), preventing garbled text. * **Integrated:** It works seamlessly with Click's other features, like redirecting output or testing. Using it is just like `print()`: ```python # echo_example.py import click @click.command() def cli(): """Demonstrates click.echo""" click.echo("Hello from Click!") # You can print errors to stderr easily click.echo("Oops, something went wrong!", err=True) if __name__ == '__main__': cli() ``` Running this: ```bash $ python echo_example.py Hello from Click! Oops, something went wrong! # (This line goes to stderr) ``` Simple! For most printing in Click apps, `click.echo()` is preferred over `print()`. ## Adding Style: `click.style()` and `click.secho()` Want to make your output stand out? Click makes it easy to add colors and styles (like bold or underline) to your text. * `click.style(text, fg='color', bg='color', bold=True, ...)`: Takes your text and wraps it with special codes that terminals understand to change its appearance. It returns the modified string. * `click.secho(text, fg='color', ...)`: A shortcut that combines `style` and `echo`. It styles the text *and* prints it in one go. Let's make our success and error messages more obvious: ```python # style_example.py import click @click.command() def cli(): """Demonstrates styled output""" # Style the text first, then echo it success_message = click.style("Operation successful!", fg='green', bold=True) click.echo(success_message) # Or use secho for style + echo in one step click.secho("Critical error!", fg='red', underline=True, err=True) if __name__ == '__main__': cli() ``` Running this (your terminal must support color): ```bash $ python style_example.py # Output will look something like: # Operation successful! (in bold green) # Critical error! (in underlined red, sent to stderr) ``` Click supports various colors (`'red'`, `'green'`, `'blue'`, etc.) and styles (`bold`, `underline`, `blink`, `reverse`). This makes your CLI output much more informative at a glance! ## Getting User Input: `click.prompt()` Sometimes you need to ask the user for information. `click.prompt()` is designed for this. It shows a message and waits for the user to type something and press Enter. ```python # prompt_example.py import click @click.command() def cli(): """Asks for user input""" name = click.prompt("Please enter your name") click.echo(f"Hello, {name}!") # You can specify a default value location = click.prompt("Enter location", default="Earth") click.echo(f"Location: {location}") # You can also require a specific type (like an integer) age = click.prompt("Enter your age", type=int) click.echo(f"You are {age} years old.") if __name__ == '__main__': cli() ``` Running this interactively: ```bash $ python prompt_example.py Please enter your name: Alice Hello, Alice! Enter location [Earth]: # Just press Enter here Location: Earth Enter your age: 30 You are 30 years old. ``` If you enter something that can't be converted to the `type` (like "abc" for age), `click.prompt` will automatically show an error and ask again! It can also hide input for passwords (`hide_input=True`). ## Asking Yes/No: `click.confirm()` A common need is asking for confirmation before doing something potentially destructive or time-consuming. `click.confirm()` handles this nicely. ```python # confirm_example.py import click import time @click.command() @click.option('--yes', is_flag=True, help='Assume Yes to confirmation.') def cli(yes): """Asks for confirmation.""" click.echo("This might take a while or change things.") # If --yes flag is given, `yes` is True, otherwise ask. # abort=True means if user says No, stop the program. if not yes: click.confirm("Do you want to continue?", abort=True) click.echo("Starting operation...") time.sleep(2) # Simulate work click.echo("Done!") if __name__ == '__main__': cli() ``` Running interactively: ```bash $ python confirm_example.py This might take a while or change things. Do you want to continue? [y/N]: y # User types 'y' Starting operation... Done! ``` If the user types 'n' (or just presses Enter, since the default is No - indicated by `[y/N]`), the program will stop immediately because of `abort=True`. If you run `python confirm_example.py --yes`, it skips the question entirely. ## Showing Progress: `click.progressbar()` For tasks that take a while, it's good practice to show the user that something is happening. `click.progressbar()` creates a visual progress bar. You typically use it with a Python `with` statement around a loop. Let's simulate processing a list of items: ```python # progress_example.py import click import time items_to_process = range(100) # Simulate 100 items @click.command() def cli(): """Shows a progress bar.""" # 'items_to_process' is the iterable # 'label' is the text shown before the bar with click.progressbar(items_to_process, label="Processing items") as bar: for item in bar: # Simulate work for each item time.sleep(0.05) # The 'bar' automatically updates with each iteration click.echo("Finished processing!") if __name__ == '__main__': cli() ``` When you run this, you'll see a progress bar update in your terminal: ```bash $ python progress_example.py Processing items [####################################] 100% 00:00:05 Finished processing! # (The bar animates in place while running) ``` The progress bar automatically figures out the percentage and estimated time remaining (ETA). It makes long tasks much less mysterious for the user. You can also use it without an iterable by manually calling the `bar.update(increment)` method inside the `with` block. ## How Term UI Works Under the Hood These functions seem simple, but they handle quite a bit behind the scenes: 1. **Abstraction:** They provide a high-level API for common terminal tasks, hiding the low-level details. 2. **Input Handling:** Functions like `prompt` and `confirm` use Python's built-in `input()` or `getpass.getpass()` (for hidden input). They add loops for retries, default value handling, and type conversion/validation (using [ParamType](04_paramtype.md) concepts internally). 3. **Output Handling (`echo`, `secho`):** * They check if the output stream (`stdout` or `stderr`) is connected to a terminal (`isatty`). * If not a terminal, or if color is disabled, `style` codes are automatically removed (`strip_ansi`). * On Windows, if `colorama` is installed, Click wraps the output streams to translate ANSI color codes into Windows API calls, making colors work automatically. 4. **Progress Bar (`progressbar`):** * It calculates the percentage complete based on the iterable's length (or the provided `length`). * It estimates the remaining time (ETA) by timing recent iterations. * It formats the bar (`#` and `-` characters) and info text. * Crucially, it uses special terminal control characters (like `\r` - carriage return) to move the cursor back to the beginning of the line before printing the updated bar. This makes the bar *appear* to update in place rather than printing many lines. It also hides/shows the cursor during updates (`\033[?25l`, `\033[?25h`) on non-Windows systems for a smoother look. 5. **Cross-Platform Compatibility:** A major goal is to make these interactions work consistently across different operating systems and terminal types, handling quirks like Windows console limitations (`_winconsole.py`, `_compat.py`). Let's visualize what might happen when you call `click.secho("Error!", fg='red', err=True)`: ```mermaid sequenceDiagram participant UserCode as Your Code participant ClickSecho as click.secho() participant ClickStyle as click.style() participant ClickEcho as click.echo() participant CompatLayer as Click Compatibility Layer participant Terminal UserCode->>ClickSecho: secho("Error!", fg='red', err=True) ClickSecho->>ClickStyle: style("Error!", fg='red', ...) ClickStyle-->>ClickSecho: Returns "\033[31mError!\033[0m" (styled text) ClickSecho->>ClickEcho: echo("\033[31mError!\033[0m", err=True) ClickEcho->>CompatLayer: Check if output (stderr) is a TTY CompatLayer-->>ClickEcho: Yes, it's a TTY ClickEcho->>CompatLayer: Check if color is enabled CompatLayer-->>ClickEcho: Yes, color is enabled Note over ClickEcho, Terminal: On Windows, may wrap stream with Colorama here ClickEcho->>CompatLayer: Write styled text to stderr CompatLayer->>Terminal: Writes "\033[31mError!\033[0m\n" Terminal-->>Terminal: Displays "Error!" in red ``` The key is that Click adds layers of checks and formatting (`style`, color stripping, platform adaptation) around the basic act of printing (`echo`) or getting input (`prompt`). You can find the implementation details in: * `click/termui.py`: Defines the main functions like `prompt`, `confirm`, `style`, `secho`, `progressbar`, `echo_via_pager`. * `click/_termui_impl.py`: Contains the implementations for more complex features like `ProgressBar`, `Editor`, `pager`, and `getchar`. * `click/utils.py`: Contains `echo` and helpers like `open_stream`. * `click/_compat.py` & `click/_winconsole.py`: Handle differences between Python versions and operating systems, especially for terminal I/O and color support on Windows. ## Conclusion Click's **Term UI** functions are essential for creating command-line applications that are interactive, informative, and pleasant to use. You've learned how to: * Print output reliably with `click.echo`. * Add visual flair with colors and styles using `click.style` and `click.secho`. * Ask the user for input with `click.prompt`. * Get yes/no confirmation using `click.confirm`. * Show progress for long tasks with `click.progressbar`. These tools handle many cross-platform complexities, letting you focus on building the core logic of your interactive CLI. But what happens when things go wrong? How does Click handle errors, like invalid user input or missing files? That's where Click's exception handling comes in. Let's dive into that next! Next up: [Chapter 7: Click Exceptions](07_click_exceptions.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Click/07_click_exceptions.md ================================================ --- layout: default title: "Click Exceptions" parent: "Click" nav_order: 7 --- # Chapter 7: Click Exceptions - Handling Errors Gracefully In the last chapter, [Chapter 6: Term UI (Terminal User Interface)](06_term_ui__terminal_user_interface_.md), we explored how to make our command-line tools interactive and visually appealing using functions like `click.prompt`, `click.confirm`, and `click.secho`. We learned how to communicate effectively *with* the user. But what happens when the user doesn't communicate effectively with *us*? What if they type the wrong command, forget a required argument, or enter text when a number was expected? Our programs need a way to handle these errors without just crashing. This is where **Click Exceptions** come in. They are Click's way of signaling that something went wrong, usually because of a problem with the user's input or how they tried to run the command. ## Why Special Exceptions? The Problem with Crashes Imagine you have a command that needs a number, like `--count 5`. You used `type=click.INT` like we learned in [Chapter 4: ParamType](04_paramtype.md). What happens if the user types `--count five`? If Click didn't handle this specially, the `int("five")` conversion inside Click would fail, raising a standard Python `ValueError`. This might cause your program to stop with a long, confusing Python traceback message that isn't very helpful for the end-user. They might not understand what went wrong or how to fix it. Click wants to provide a better experience. When something like this happens, Click catches the internal error and raises one of its own **custom exception types**. These special exceptions tell Click exactly what kind of problem occurred (e.g., bad input, missing argument). ## Meet the Click Exceptions Click has a family of exception classes designed specifically for handling command-line errors. The most important ones inherit from the base class `click.ClickException`. Here are some common ones you'll encounter (or use): * `ClickException`: The base for all Click-handled errors. * `UsageError`: A general error indicating the command was used incorrectly (e.g., wrong number of arguments). It usually prints the command's usage instructions. * `BadParameter`: Raised when the value provided for an option or argument is invalid (e.g., "five" for an integer type, or a value not in a `click.Choice`). * `MissingParameter`: Raised when a required option or argument is not provided. * `NoSuchOption`: Raised when the user tries to use an option that doesn't exist (e.g., `--verrbose` instead of `--verbose`). * `FileError`: Raised by `click.File` or `click.Path` if a file can't be opened or accessed correctly. * `Abort`: A special exception you can raise to stop execution immediately (like after a failed `click.confirm`). **The Magic:** The really neat part is that Click's main command processing logic is designed to *catch* these specific exceptions. When it catches one, it doesn't just crash. Instead, it: 1. **Formats a helpful error message:** Often using information from the exception itself (like which parameter was bad). 2. **Prints the message** (usually prefixed with "Error:") to the standard error stream (`stderr`). 3. **Often shows relevant help text** (like the command's usage synopsis). 4. **Exits the application cleanly** with a non-zero exit code (signaling to the system that an error occurred). This gives the user clear feedback about what they did wrong and how to potentially fix it, without seeing scary Python tracebacks. ## Seeing Exceptions in Action (Automatically) You've already seen Click exceptions working! Remember our `count_app.py` from [Chapter 4: ParamType](04_paramtype.md)? ```python # count_app.py (from Chapter 4) import click @click.command() @click.option('--count', default=1, type=click.INT, help='Number of times to print.') @click.argument('message') def repeat(count, message): """Prints MESSAGE the specified number of times.""" for _ in range(count): click.echo(message) if __name__ == '__main__': repeat() ``` If you run this with invalid input for `--count`: ```bash $ python count_app.py --count five "Oh no" Usage: count_app.py [OPTIONS] MESSAGE Try 'count_app.py --help' for help. Error: Invalid value for '--count': 'five' is not a valid integer. ``` That clear "Error: Invalid value for '--count': 'five' is not a valid integer." message? That's Click catching a `BadParameter` exception (raised internally by `click.INT.convert`) and showing it nicely! What if you forget the required `MESSAGE` argument? ```bash $ python count_app.py --count 3 Usage: count_app.py [OPTIONS] MESSAGE Try 'count_app.py --help' for help. Error: Missing argument 'MESSAGE'. ``` Again, a clear error message! This time, Click caught a `MissingParameter` exception. ## Raising Exceptions Yourself: Custom Validation Click raises exceptions automatically for many common errors. But sometimes, you have validation logic that's specific to your application. For example, maybe an `--age` option must be positive. The standard way to report these custom validation errors is to **raise a `click.BadParameter` exception** yourself, usually from within a callback function. Let's add a callback to our `count_app.py` to ensure `count` is positive. ```python # count_app_validate.py import click # 1. Define a validation callback function def validate_count(ctx, param, value): """Callback to ensure count is positive.""" if value <= 0: # 2. Raise BadParameter if validation fails raise click.BadParameter("Count must be a positive number.") # 3. Return the value if it's valid return value @click.command() # 4. Attach the callback to the --count option @click.option('--count', default=1, type=click.INT, help='Number of times to print.', callback=validate_count) # <-- Added callback @click.argument('message') def repeat(count, message): """Prints MESSAGE the specified number of times (must be positive).""" for _ in range(count): click.echo(message) if __name__ == '__main__': repeat() ``` Let's break down the changes: 1. `def validate_count(ctx, param, value):`: We defined a function that takes the [Context](05_context.md), the [Parameter](03_parameter__option___argument_.md) object, and the *already type-converted* value. 2. `raise click.BadParameter(...)`: If the `value` (which we know is an `int` thanks to `type=click.INT`) is not positive, we raise `click.BadParameter` with our custom error message. 3. `return value`: If the value is valid, the callback **must** return it. 4. `callback=validate_count`: We told the `--count` option to use our `validate_count` function after type conversion. **Run it with invalid input:** ```bash $ python count_app_validate.py --count 0 "Zero?" Usage: count_app_validate.py [OPTIONS] MESSAGE Try 'count_app_validate.py --help' for help. Error: Invalid value for '--count': Count must be a positive number. $ python count_app_validate.py --count -5 "Negative?" Usage: count_app_validate.py [OPTIONS] MESSAGE Try 'count_app_validate.py --help' for help. Error: Invalid value for '--count': Count must be a positive number. ``` It works! Our custom validation logic triggered, we raised `click.BadParameter`, and Click caught it, displaying our specific error message cleanly. This is the standard way to integrate your own validation rules into Click's error handling. ## How Click Handles Exceptions (Under the Hood) What exactly happens when a Click exception is raised, either by Click itself or by your code? 1. **Raise:** An operation fails (like type conversion, parsing finding a missing argument, or your custom callback). A specific `ClickException` subclass (e.g., `BadParameter`, `MissingParameter`) is instantiated and raised. 2. **Catch:** Click's main application runner (usually triggered when you call your top-level `cli()` function) has a `try...except ClickException` block around the command execution logic. 3. **Show:** When a `ClickException` is caught, the runner calls the exception object's `show()` method. 4. **Format & Print:** The `show()` method (defined in `exceptions.py` for each exception type) formats the error message. * `UsageError` (and its subclasses like `BadParameter`, `MissingParameter`, `NoSuchOption`) typically includes the command's usage string (`ctx.get_usage()`) and a hint to try the `--help` option. * `BadParameter` adds context like "Invalid value for 'PARAMETER_NAME':". * `MissingParameter` formats "Missing argument/option 'PARAMETER_NAME'.". * The formatted message is printed to `stderr` using `click.echo()`, respecting color settings from the context. 5. **Exit:** After showing the message, Click calls `sys.exit()` with the exception's `exit_code` (usually `1` for general errors, `2` for usage errors). This terminates the program and signals the error status to the calling shell or script. Here's a simplified sequence diagram for the `BadParameter` case when a user provides invalid input that fails type conversion: ```mermaid sequenceDiagram participant User participant CLI as YourApp.py participant ClickRuntime participant ParamType as ParamType (e.g., click.INT) participant ClickExceptionHandling User->>CLI: python YourApp.py --count five CLI->>ClickRuntime: Starts command execution ClickRuntime->>ParamType: Calls convert(value='five', ...) for '--count' ParamType->>ParamType: Tries int('five'), raises ValueError ParamType->>ClickExceptionHandling: Catches ValueError, calls self.fail(...) ClickExceptionHandling->>ClickExceptionHandling: Raises BadParameter("...'five' is not...") ClickExceptionHandling-->>ClickRuntime: BadParameter propagates up ClickRuntime->>ClickExceptionHandling: Catches BadParameter exception ClickExceptionHandling->>ClickExceptionHandling: Calls exception.show() ClickExceptionHandling->>CLI: Prints formatted "Error: Invalid value..." to stderr ClickExceptionHandling->>CLI: Calls sys.exit(exception.exit_code) CLI-->>User: Shows error message and exits ``` The core exception classes are defined in `click/exceptions.py`. You can see how `ClickException` defines the basic `show` method and `exit_code`, and how subclasses like `UsageError` and `BadParameter` override `format_message` to provide more specific output based on the context (`ctx`) and parameter (`param`) they might hold. ```python # Simplified structure from click/exceptions.py class ClickException(Exception): exit_code = 1 def __init__(self, message: str) -> None: # ... (stores message, gets color settings) ... self.message = message def format_message(self) -> str: return self.message def show(self, file=None) -> None: # ... (gets stderr if file is None) ... echo(f"Error: {self.format_message()}", file=file, color=self.show_color) class UsageError(ClickException): exit_code = 2 def __init__(self, message: str, ctx=None) -> None: super().__init__(message) self.ctx = ctx # ... def show(self, file=None) -> None: # ... (gets stderr, color) ... hint = "" if self.ctx is not None and self.ctx.command.get_help_option(self.ctx): hint = f"Try '{self.ctx.command_path} {self.ctx.help_option_names[0]}' for help.\n" if self.ctx is not None: echo(f"{self.ctx.get_usage()}\n{hint}", file=file, color=color) # Call the base class's logic to print "Error: ..." echo(f"Error: {self.format_message()}", file=file, color=color) class BadParameter(UsageError): def __init__(self, message: str, ctx=None, param=None, param_hint=None) -> None: super().__init__(message, ctx) self.param = param self.param_hint = param_hint def format_message(self) -> str: # ... (logic to get parameter name/hint) ... param_hint = self.param.get_error_hint(self.ctx) if self.param else self.param_hint # ... return f"Invalid value for {param_hint}: {self.message}" # Other exceptions like MissingParameter, NoSuchOption follow similar patterns ``` By using this structured exception system, Click ensures that user errors are reported consistently and helpfully across any Click application. ## Conclusion Click Exceptions are the standard mechanism for reporting errors related to command usage and user input within Click applications. You've learned: * Click uses custom exceptions like `UsageError`, `BadParameter`, and `MissingParameter` to signal specific problems. * Click catches these exceptions automatically to display user-friendly error messages, usage hints, and exit cleanly. * You can (and should) raise exceptions like `click.BadParameter` in your own validation callbacks to report custom errors in a standard way. * This system prevents confusing Python tracebacks and provides helpful feedback to the user. Understanding and using Click's exception hierarchy is key to building robust and user-friendly command-line interfaces that handle problems gracefully. This concludes our journey through the core concepts of Click! We've covered everything from basic [Commands and Groups](01_command___group.md), [Decorators](02_decorators.md), [Parameters](03_parameter__option___argument_.md), and [Types](04_paramtype.md), to managing runtime state with the [Context](05_context.md), creating interactive [Terminal UIs](06_term_ui__terminal_user_interface_.md), and handling errors with [Click Exceptions](07_click_exceptions.md). Armed with this knowledge, you're well-equipped to start building your own powerful and elegant command-line tools with Click! --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Click/index.md ================================================ --- layout: default title: "Click" nav_order: 6 has_children: true --- # Tutorial: Click > This tutorial is AI-generated! To learn more, check out [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) Click[View Repo](https://github.com/pallets/click/tree/main/src/click) is a Python library that makes creating **command-line interfaces (CLIs)** *easy and fun*. It uses simple Python **decorators** (`@click.command`, `@click.option`, etc.) to turn your functions into CLI commands with options and arguments. Click handles parsing user input, generating help messages, validating data types, and managing the flow between commands, letting you focus on your application's logic. It also provides tools for *terminal interactions* like prompting users and showing progress bars. ```mermaid flowchart TD A0["Context"] A1["Command / Group"] A2["Parameter (Option / Argument)"] A3["ParamType"] A4["Decorators"] A5["Term UI (Terminal User Interface)"] A6["Click Exceptions"] A4 -- "Creates/Configures" --> A1 A4 -- "Creates/Configures" --> A2 A0 -- "Manages execution of" --> A1 A0 -- "Holds parsed values for" --> A2 A2 -- "Uses for validation/conversion" --> A3 A3 -- "Raises on conversion error" --> A6 A1 -- "Uses for user interaction" --> A5 A0 -- "Handles/Raises" --> A6 A4 -- "Injects via @pass_context" --> A0 ``` ================================================ FILE: docs/Codex/01_terminal_ui__ink_components_.md ================================================ --- layout: default title: "Terminal UI (Ink Components)" parent: "Codex" nav_order: 1 --- # Chapter 1: Terminal UI (Ink Components) Welcome to the Codex tutorial! We're excited to have you explore how Codex works under the hood. This first chapter dives into how Codex creates its chat interface right inside your terminal window. ## What's the Big Idea? Imagine you want `Codex` to write a simple script. You type something like `codex "write a python script that prints hello world"` into your terminal. How does Codex show you the conversation – your request, its response, maybe questions it asks, or commands it suggests running – all without opening a separate window? And how do you type your next message? That's where the **Terminal UI** comes in. It's the system responsible for drawing the entire chat interface you see and interact with directly in your command line. Think of it like the dashboard and controls of a car: * **Dashboard:** Displays information (like the chat history, AI messages, loading indicators). * **Controls (Steering Wheel, Pedals):** Let you interact (like the input field where you type messages, or menus to approve commands). Just like the car's dashboard lets you see what the engine is doing and control it, the Terminal UI lets you see what the core `Codex` logic (the [Agent Loop](03_agent_loop.md)) is doing and provide input to it. ## Key Concepts: Ink & React How does Codex build this terminal interface? It uses two main technologies: 1. **Ink:** This is a fantastic library that lets developers build command-line interfaces using **React**. If you know React for web development, Ink feels very similar, but instead of rendering buttons and divs in a browser, it renders text, boxes, and lists in your terminal. 2. **React Components:** The UI is broken down into reusable pieces called React components. We have components for: * Displaying individual messages (`TerminalChatResponseItem`). * Showing the whole conversation history (`MessageHistory`). * The text box where you type your input (`TerminalChatInput` / `TerminalChatNewInput`). * Prompts asking you to approve commands (`TerminalChatCommandReview`). * Spinners to show when Codex is thinking. These components work together, managed by React, to create the dynamic interface you see. ## How You See It: Rendering the Chat When you run `Codex`, the main application component (`App` in `app.tsx`) kicks things off. It might first check if you're in a safe directory (like a Git repository) and ask for confirmation if not. ```tsx // File: codex-cli/src/app.tsx (Simplified) // ... imports ... import TerminalChat from "./components/chat/terminal-chat"; import { ConfirmInput } from "@inkjs/ui"; import { Box, Text, useApp } from "ink"; import React, { useState } from "react"; export default function App({ /* ...props... */ }): JSX.Element { const app = useApp(); const [accepted, setAccepted] = useState(/* ... */); const inGitRepo = /* ... check if in git ... */; // If not in a git repo and not yet accepted, show a warning if (!inGitRepo && !accepted) { return ( Warning! Not in a git repo. setAccepted(true)} onCancel={() => app.exit()} /> ); } // Otherwise, render the main chat interface return ; } ``` This snippet shows how the `App` component uses Ink's ``, ``, and even interactive components like ``. If the safety check passes, it renders the core `` component. The `` component (`terminal-chat.tsx`) is the main hub for the chat UI. It manages the state, like the list of messages (`items`), whether the AI is currently working (`loading`), and any command confirmations needed (`confirmationPrompt`). ```tsx // File: codex-cli/src/components/chat/terminal-chat.tsx (Simplified) // ... imports ... import TerminalMessageHistory from "./terminal-message-history"; import TerminalChatInput from "./terminal-chat-input"; // Or TerminalChatNewInput import { Box } from "ink"; import React, { useState } from "react"; export default function TerminalChat({ /* ...props... */ }): React.ReactElement { const [items, setItems] = useState>([]); // Holds all messages const [loading, setLoading] = useState(false); // Is the AI busy? const [confirmationPrompt, setConfirmationPrompt] = useState(null); // Command to review? // ... other state and logic ... return ( {/* Display the conversation history */} {/* Display the input box or the command review prompt */} { /* Send to Agent Loop */ }} submitConfirmation={(/*...decision...*/) => { /* Send to Agent Loop */ }} /* ...other props... */ /> ); } ``` * `` takes the list of `items` (messages) and renders them. * `` (or its multiline sibling ``) displays the input box when `loading` is false and there's no `confirmationPrompt`. If there *is* a `confirmationPrompt`, it shows the command review UI instead. ### Showing Messages How does `` actually display the messages? It uses a special Ink component called `` for efficiency and maps each message `item` to a ``. ```tsx // File: codex-cli/src/components/chat/terminal-message-history.tsx (Simplified) // ... imports ... import TerminalChatResponseItem from "./terminal-chat-response-item"; import { Box, Static } from "ink"; import React from "react"; const MessageHistory: React.FC = ({ batch, /* ... */ }) => { // Extract the actual message objects const messages = batch.map(({ item }) => item!); return ( {/* renders past items efficiently */} {(message, index) => ( // Render each message using TerminalChatResponseItem )} ); }; export default React.memo(MessageHistory); ``` `` tells Ink that these items won't change often, allowing Ink to optimize rendering. Each message is passed to ``. Inside `TerminalChatResponseItem` (`terminal-chat-response-item.tsx`), we figure out what *kind* of message it is (user message, AI response, command output, etc.) and render it accordingly using Ink's basic `` and `` components, sometimes with helpers like `` for formatting. ```tsx // File: codex-cli/src/components/chat/terminal-chat-response-item.tsx (Simplified) // ... imports ... import { Box, Text } from "ink"; import React from "react"; // ... other components like Markdown ... export default function TerminalChatResponseItem({ item }: { item: ResponseItem }): React.ReactElement { switch (item.type) { case "message": // User or AI text message return ( {item.role === "assistant" ? "codex" : item.role} {/* Render message content, potentially using Markdown */} {/* ... content ... */} ); case "function_call": // AI wants to run a command return ( command $ {/* Formatted command */} ); // ... other cases like function_call_output ... default: return Unknown message type; } } ``` ### Getting Your Input The `` (or ``) component uses specialized input components (like `` from `ink-text-input` or our custom ``) to capture your keystrokes. When you press Enter, it calls the `onSubmit` or `submitInput` function provided by ``. ```tsx // File: codex-cli/src/components/chat/terminal-chat-new-input.tsx (Simplified) // ... imports ... import MultilineTextEditor from "./multiline-editor"; // Custom multiline input import { Box, Text, useInput } from "ink"; import React, { useState } from "react"; export default function TerminalChatInput({ submitInput, active, /* ... */ }): React.ReactElement { const [input, setInput] = useState(""); // Current text in the editor const editorRef = React.useRef(/* ... */); // Handle to editor // useInput hook from Ink handles key presses (like Up/Down for history) useInput((_input, _key) => { // Handle history navigation (Up/Down arrows) // ... logic using editorRef.current.getRow() ... }, { isActive: active }); return ( {/* The actual input field */} setInput(txt)} initialText={input} focus={active} // Only active when overlay isn't shown onSubmit={(text) => { // When Enter is pressed (and not escaped) submitInput(/* ...create input item from text... */); setInput(""); // Clear the input field }} /> {/* Help text */} ctrl+c to exit | enter to send ); } ``` This component manages the text you type and uses Ink's `useInput` hook to handle special keys like arrow keys for command history. The details of text editing are handled in the next chapter: [Input Handling (TextBuffer/Editor)](02_input_handling__textbuffer_editor_.md). ### Reviewing Commands If the [Agent Loop](03_agent_loop.md) decides it needs to run a command and requires your approval, `` will receive a `confirmationPrompt`. This prompt (which is itself a React element, often ``) is passed down to ``, which then renders `` instead of the regular input box. ```tsx // File: codex-cli/src/components/chat/terminal-chat-command-review.tsx (Simplified) // ... imports ... // @ts-expect-error - Using a vendor component for selection import { Select } from "../vendor/ink-select/select"; import TextInput from "../vendor/ink-text-input"; // For editing feedback import { Box, Text, useInput } from "ink"; import React from "react"; export function TerminalChatCommandReview({ confirmationPrompt, // The command display element onReviewCommand, // Function to call with the decision }: { /* ... */ }): React.ReactElement { const [mode, setMode] = React.useState<"select" | "input">("select"); // Select Yes/No or type feedback // Options for the selection list const approvalOptions = [ { label: "Yes (y)", value: ReviewDecision.YES }, // ... other options like Always, Edit, No ... ]; useInput((input, key) => { /* Handle shortcuts like 'y', 'n', 'e', Esc */ }); return ( {/* Display the command that needs review */} {confirmationPrompt} {mode === "select" ? ( <> Allow command? ` component (or a `` if you choose to edit/give feedback), listens for your choice (via keyboard shortcuts or the selection list), and finally calls `onReviewCommand` with your decision. ## Under the Hood: How It All Connects Let's trace the flow from starting Codex to seeing an AI response: ```mermaid sequenceDiagram participant User participant Terminal participant CodexCLI participant InkReactApp as Ink/React UI participant AgentLoop as Agent Loop User->>Terminal: Runs `codex "prompt"` Terminal->>CodexCLI: Starts the process CodexCLI->>InkReactApp: Renders initial UI (`App` -> `TerminalChat`) InkReactApp->>Terminal: Displays UI (header, empty chat, input box) User->>Terminal: Types message, presses Enter Terminal->>InkReactApp: Captures input (`TerminalChatInput`) InkReactApp->>AgentLoop: Sends user input via `submitInput` prop (in `TerminalChat`) Note over AgentLoop: Processes input, calls LLM... AgentLoop->>InkReactApp: Sends back AI response via `onItem` prop (in `TerminalChat`) InkReactApp->>InkReactApp: Updates state (`items`), triggers re-render InkReactApp->>Terminal: Re-renders UI with new message (`MessageHistory`) ``` 1. You run `codex`. 2. The CLI process starts. 3. The React application (`App` -> `TerminalChat`) renders the initial UI using Ink components. Ink translates these components into terminal commands to draw the interface. 4. You type your message into the `` component. 5. When you press Enter, the input component's `onSubmit` handler is called. 6. `` receives this, packages it, and calls the `run` method on the [Agent Loop](03_agent_loop.md). 7. The Agent Loop processes the input (often calling an LLM). 8. When the Agent Loop has something to display (like the AI's text response), it calls the `onItem` callback function provided by ``. 9. `` receives the new message item and updates its `items` state using `setItems`. 10. React detects the state change and tells Ink to re-render the necessary components (like adding the new message to ``). 11. Ink updates the terminal display. The process for handling command confirmations is similar, involving the `getCommandConfirmation` and `submitConfirmation` callbacks between `` and the Agent Loop, rendering `` in the UI when needed. ## Conclusion You've now seen how Codex uses the power of React and the Ink library to build a fully interactive chat interface directly within your terminal. This "Terminal UI" layer acts as the visual front-end, displaying messages, capturing your input, and presenting choices like command approvals, all while coordinating with the core [Agent Loop](03_agent_loop.md) behind the scenes. But how exactly does that input box capture your keystrokes, handle multi-line editing, and manage command history? We'll explore that in the next chapter. Next up: [Input Handling (TextBuffer/Editor)](02_input_handling__textbuffer_editor_.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/02_input_handling__textbuffer_editor_.md ================================================ --- layout: default title: "Input Handling (TextBuffer/Editor)" parent: "Codex" nav_order: 2 --- # Chapter 2: Input Handling (TextBuffer/Editor) In the [previous chapter](01_terminal_ui__ink_components_.md), we saw how Codex uses Ink and React to draw the chat interface in your terminal. We learned about components like `` and `` that show an input box. But how does that input box *actually work*? ## Why a Fancy Input Box? Imagine you want Codex to write a small Python script. You might type something like this: ```python Write a python function that: 1. Takes a list of numbers. 2. Returns a new list containing only the even numbers. Make sure it handles empty lists gracefully. ``` Or maybe you're reviewing a command Codex proposed and want to give detailed feedback. A simple, single-line input field like your shell's basic prompt would be really awkward for this! You'd want to: * Write multiple lines easily. * Use arrow keys to move your cursor around to fix typos. * Maybe jump back a whole word (`Ctrl+LeftArrow`) or delete a word (`Ctrl+Backspace`). * Press `Up` or `Down` arrow to bring back previous messages you sent (history). * Perhaps even open the current text in your main code editor (like VS Code or Vim) for complex edits (`Ctrl+X`). This is where the **Input Handling** system comes in. It's like a mini text editor built right into the Codex chat interface, designed to make typing potentially complex prompts and messages much easier than a standard terminal input line. ## Key Idea: The `TextBuffer` The heart of this system is a class called `TextBuffer` (found in `text-buffer.ts`). Think of `TextBuffer` like the hidden document model behind a simple text editor (like Notepad or TextEdit): * **It holds the text:** It stores all the lines of text you've typed into the input box in an internal list (an array of strings called `lines`). * **It knows where the cursor is:** It keeps track of the cursor's position (which `row` and `column` it's on). * **It handles edits:** When you press keys like letters, numbers, Backspace, Delete, or Enter, the `TextBuffer` modifies the text and updates the cursor position accordingly. * **It manages scrolling:** If your text gets longer than the input box can display, the `TextBuffer` figures out which part of the text should be visible. The `MultilineTextEditor` React component we saw in Chapter 1 uses an instance of this `TextBuffer` internally to manage the state of the text being edited. ## How You Use It (Indirectly) You don't directly interact with `TextBuffer` yourself. You interact with the `` component displayed by Ink. But understanding `TextBuffer` helps you see *how* the editor works. Let's look at a simplified view of how the `` component uses ``: ```tsx // File: codex-cli/src/components/chat/terminal-chat-new-input.tsx (Simplified) import React, { useState, useCallback } from "react"; import { Box, Text, useInput } from "ink"; import MultilineTextEditor from "./multiline-editor"; // Our editor component // ... other imports export default function TerminalChatInput({ submitInput, active, /* ... */ }) { const [input, setInput] = useState(""); // Holds the current text in the editor state const [history, setHistory] = useState([]); // Holds past submitted messages const [historyIndex, setHistoryIndex] = useState(null); // Used to force re-render editor when history changes text const [editorKey, setEditorKey] = useState(0); const editorRef = React.useRef(/* ... */); // Handle to the editor // --- History Handling (Simplified) --- useInput((_input, key) => { // Check if Up/Down arrow pressed AND cursor is at top/bottom line const isAtTop = editorRef.current?.isCursorAtFirstRow(); const isAtBottom = editorRef.current?.isCursorAtLastRow(); if (key.upArrow && isAtTop && history.length > 0) { // Logic to go back in history const newIndex = historyIndex === null ? history.length - 1 : Math.max(0, historyIndex - 1); setHistoryIndex(newIndex); setInput(history[newIndex] ?? ""); // Set the text to the historical item setEditorKey(k => k + 1); // Force editor to re-mount with new text // ... save draft if needed ... } else if (key.downArrow && isAtBottom && historyIndex !== null) { // Logic to go forward in history or restore draft // ... similar logic using setInput, setHistoryIndex, setEditorKey ... } // Note: If not handling history, the key press falls through to MultilineTextEditor }, { isActive: active }); // --- Submission Handling --- const onSubmit = useCallback((textFromEditor: string) => { const trimmedText = textFromEditor.trim(); if (!trimmedText) return; // Ignore empty submissions // Add to history setHistory(prev => [...prev, textFromEditor]); setHistoryIndex(null); // Reset history navigation // Send the input to the Agent Loop! submitInput(/* ... create input item from trimmedText ... */); // Clear the input for the next message setInput(""); setEditorKey(k => k + 1); // Force editor reset }, [submitInput, setHistory /* ... */]); return ( {/* The actual editor component */} setInput(text)} // Update React state when text changes internally onSubmit={onSubmit} // Tell editor what to do on Enter height={8} // Example height /> ctrl+c exit | enter send | ↑↓ history | ctrl+x editor ); } ``` * **`initialText={input}`:** The `` starts with the text held in the `input` state variable. This is how history navigation works – we change `input` and force a re-render. * **`onChange={(text) => setInput(text)}`:** Whenever the text *inside* the `MultilineTextEditor` (managed by its internal `TextBuffer`) changes, it calls this function. We update the `input` state variable in the parent component (`TerminalChatNewInput`) to keep track, though often the editor manages its own state primarily. * **`onSubmit={onSubmit}`:** When you press Enter (in a way that signifies submission, not just adding a newline), the `MultilineTextEditor` calls this `onSubmit` function, passing the final text content. This function then sends the message off to the [Agent Loop](03_agent_loop.md) and clears the input. * **History (`useInput`):** The parent component (`TerminalChatNewInput`) uses Ink's `useInput` hook to *intercept* the Up/Down arrow keys *before* they even reach the `MultilineTextEditor`. It checks if the cursor (using `editorRef.current?.isCursorAtFirstRow()`) is at the very top/bottom edge of the text. If so, it handles history navigation by changing the `input` state and forcing the editor to update using `setEditorKey`. If the cursor isn't at the edge, it lets the arrow key "fall through" to the `MultilineTextEditor`, which then just moves the cursor normally within the text via its internal `TextBuffer`. ## Under the Hood: Keystroke to Display Let's trace what happens when you type a character, say 'h', into the input box: ```mermaid sequenceDiagram participant User participant Terminal participant InkUI as Ink/React (MultilineTextEditor) participant TextBuffer participant AgentLoop as Agent Loop (Not involved) User->>Terminal: Presses 'h' key Terminal->>InkUI: Terminal sends key event to Ink InkUI->>InkUI: `useInput` hook captures 'h' InkUI->>TextBuffer: Calls `handleInput('h', { ... }, viewport)` TextBuffer->>TextBuffer: Finds current line ("") and cursor (0,0) TextBuffer->>TextBuffer: Calls `insert('h')` TextBuffer->>TextBuffer: Updates `lines` to `["h"]` TextBuffer->>TextBuffer: Updates `cursorCol` to 1 TextBuffer->>TextBuffer: Increments internal `version` TextBuffer-->>InkUI: Returns `true` (buffer was modified) InkUI->>InkUI: Triggers a React re-render because internal state changed InkUI->>TextBuffer: Calls `getVisibleLines(viewport)` -> returns `["h"]` InkUI->>TextBuffer: Calls `getCursor()` -> returns `[0, 1]` InkUI->>Terminal: Renders the updated text ("h") with cursor highlight ``` 1. **Keystroke:** You press the 'h' key. 2. **Capture:** Ink's `useInput` hook within `` receives the key event. 3. **Delegate:** `` calls the `handleInput` method on its internal `TextBuffer` instance, passing the input character ('h'), key modifier flags (like Shift, Ctrl - none in this case), and the current visible area size (viewport). 4. **Update State:** `TextBuffer.handleInput` determines it's a simple character insertion. It calls its internal `insert` method. 5. **`insert` Method:** * Gets the current line (e.g., `""`). * Splits the line at the cursor position (0). * Inserts the character: `""` + `'h'` + `""` -> `"h"`. * Updates the `lines` array: `["h"]`. * Updates the cursor column: `0` -> `1`. * Increments an internal version number to track changes. 6. **Signal Change:** `handleInput` returns `true` because the buffer was modified. 7. **Re-render:** The `` component detects the change (either via the return value or its internal state update) and triggers a React re-render. 8. **Get Display Data:** During the render, `` calls methods on the `TextBuffer` like: * `getVisibleLines()`: Gets the lines of text that should currently be visible based on scrolling. * `getCursor()`: Gets the current row and column of the cursor. 9. **Draw:** The component uses this information to render the text (`h`) in the terminal. It uses the cursor position to draw the cursor, often by rendering the character *at* the cursor position with an inverted background color (like `chalk.inverse(char)`). This same loop happens for every key press: Backspace calls `TextBuffer.backspace()`, arrow keys call `TextBuffer.move()`, Enter calls `TextBuffer.newline()` (or triggers `onSubmit`), etc. ## Diving into `TextBuffer` Code (Simplified) Let's peek inside `text-buffer.ts`: ```typescript // File: codex-cli/src/text-buffer.ts (Simplified) // Helper to check if a character is part of a "word" function isWordChar(ch: string | undefined): boolean { // Simplified: returns true if not whitespace or basic punctuation return ch !== undefined && !/[\s,.;!?]/.test(ch); } // Helper to get the length respecting multi-byte characters (like emoji) function cpLen(str: string): number { return Array.from(str).length; } // Helper to slice respecting multi-byte characters function cpSlice(str: string, start: number, end?: number): string { return Array.from(str).slice(start, end).join(''); } export default class TextBuffer { // --- Core State --- private lines: string[] = [""]; // The text, line by line private cursorRow = 0; // Cursor's current line number private cursorCol = 0; // Cursor's column (character index) on the line // ... scrollRow, scrollCol for viewport management ... private version = 0; // Increments on each change constructor(text = "") { this.lines = text.split("\n"); if (this.lines.length === 0) this.lines = [""]; // Start cursor at the end this.cursorRow = this.lines.length - 1; this.cursorCol = this.lineLen(this.cursorRow); } // --- Internal Helpers --- private line(r: number): string { return this.lines[r] ?? ""; } private lineLen(r: number): number { return cpLen(this.line(r)); } private ensureCursorInRange(): void { /* Makes sure row/col are valid */ } // --- Public Accessors --- getCursor(): [number, number] { return [this.cursorRow, this.cursorCol]; } getText(): string { return this.lines.join("\n"); } getVisibleLines(/* viewport */): string[] { // ... calculate visible lines based on scrollRow/Col ... return this.lines; // Simplified: return all lines } // --- Editing Operations --- insert(ch: string): void { // ... handle potential newlines by calling insertStr ... const line = this.line(this.cursorRow); // Use cpSlice for multi-byte character safety this.lines[this.cursorRow] = cpSlice(line, 0, this.cursorCol) + ch + cpSlice(line, this.cursorCol); this.cursorCol += cpLen(ch); // Use cpLen this.version++; } newline(): void { const line = this.line(this.cursorRow); const before = cpSlice(line, 0, this.cursorCol); const after = cpSlice(line, this.cursorCol); this.lines[this.cursorRow] = before; // Keep text before cursor on current line this.lines.splice(this.cursorRow + 1, 0, after); // Insert text after cursor as new line this.cursorRow++; // Move cursor down this.cursorCol = 0; // Move cursor to start of new line this.version++; } backspace(): void { if (this.cursorCol > 0) { // If not at start of line const line = this.line(this.cursorRow); this.lines[this.cursorRow] = cpSlice(line, 0, this.cursorCol - 1) + cpSlice(line, this.cursorCol); this.cursorCol--; this.version++; } else if (this.cursorRow > 0) { // If at start of line (but not first line) // Merge with previous line const prevLine = this.line(this.cursorRow - 1); const currentLine = this.line(this.cursorRow); const newCol = this.lineLen(this.cursorRow - 1); // Cursor goes to end of merged line this.lines[this.cursorRow - 1] = prevLine + currentLine; // Combine lines this.lines.splice(this.cursorRow, 1); // Remove the now-empty current line this.cursorRow--; this.cursorCol = newCol; this.version++; } // Do nothing if at row 0, col 0 } move(dir: 'left' | 'right' | 'up' | 'down' | 'wordLeft' | 'wordRight' | 'home' | 'end'): void { switch (dir) { case 'left': if (this.cursorCol > 0) this.cursorCol--; else if (this.cursorRow > 0) { /* Move to end of prev line */ } break; case 'right': if (this.cursorCol < this.lineLen(this.cursorRow)) this.cursorCol++; else if (this.cursorRow < this.lines.length - 1) { /* Move to start of next line */ } break; case 'up': if (this.cursorRow > 0) { this.cursorRow--; // Try to maintain horizontal position (handle preferredCol logic) this.cursorCol = Math.min(this.cursorCol, this.lineLen(this.cursorRow)); } break; // ... other cases (down, home, end) ... case 'wordLeft': { // Scan backwards from cursorCol, skip whitespace, then skip word chars // Update this.cursorCol to the start of the word/whitespace run // ... implementation details ... break; } // ... wordRight ... } this.ensureCursorInRange(); } // --- High-Level Input Handler --- handleInput(input: string | undefined, key: Record, /* viewport */): boolean { const beforeVersion = this.version; // Check key flags (key.leftArrow, key.backspace, key.ctrl, etc.) // and the `input` character itself. if (key.leftArrow && !key.ctrl && !key.meta) this.move('left'); else if (key.rightArrow && !key.ctrl && !key.meta) this.move('right'); else if (key.upArrow) this.move('up'); else if (key.downArrow) this.move('down'); else if ((key.ctrl || key.meta) && key.leftArrow) this.move('wordLeft'); // ... handle wordRight, home, end ... else if (key.backspace || input === '\x7f' /* DEL char */) this.backspace(); // ... handle delete, newline (Enter) ... else if (input && !key.ctrl && !key.meta) { // If it's a printable character (and not a special key combo) this.insert(input); } // ... ensure cursor visible based on viewport ... return this.version !== beforeVersion; // Return true if text changed } // --- External Editor --- async openInExternalEditor(): Promise { // 1. Get editor from $VISUAL or $EDITOR env var (fallback to vi/notepad) // 2. Write this.getText() to a temporary file // 3. Use Node's `spawnSync` to run the editor command with the temp file path // (This blocks until the editor is closed) // 4. Read the content back from the temp file // 5. Update this.lines, this.cursorRow, this.cursorCol // 6. Clean up the temp file this.version++; } } ``` * The `lines` array holds the actual text content. * `cursorRow` and `cursorCol` track the insertion point. * Methods like `insert`, `backspace`, `newline`, and `move` directly manipulate `lines`, `cursorRow`, and `cursorCol`. They use helpers like `cpLen` and `cpSlice` to correctly handle characters that might take up more than one byte (like emojis). * `handleInput` acts as the main entry point, deciding which specific editing operation to perform based on the key pressed. * `openInExternalEditor` handles the `Ctrl+X` magic by saving to a temp file, running your system's default editor, and reloading the content. ## Conclusion You've now seen how Codex provides a surprisingly powerful text editing experience right within your terminal. It goes far beyond a simple input line by using the `` component, which relies heavily on the internal `TextBuffer` class. This class manages the text content, cursor position, and editing operations like insertion, deletion, multi-line handling, cursor navigation (including word jumps), and even integration with external editors. This allows you to compose complex prompts or provide detailed feedback without leaving the terminal interface. With the UI drawn and user input handled, what happens next? How does Codex take your input, think about it, and generate a response or decide to run a command? That's the job of the core logic loop. Next up: [Agent Loop](03_agent_loop.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/03_agent_loop.md ================================================ --- layout: default title: "Agent Loop" parent: "Codex" nav_order: 3 --- # Chapter 3: Agent Loop In the [previous chapter](02_input_handling__textbuffer_editor_.md), we saw how Codex captures your commands and messages using a neat multi-line input editor. But once you hit Enter, where does that input *go*? What part of Codex actually understands your request, talks to the AI, and makes things happen? Meet the **Agent Loop**, the heart and brain of the Codex CLI. ## What's the Big Idea? Like a Helpful Assistant Imagine you have a very capable personal assistant. You give them a task, like "Find the latest sales report, summarize it, and email it to the team." Your assistant doesn't just magically do it all at once. They follow a process: 1. **Understand the Request:** Listen carefully to what you asked for. 2. **Gather Information:** Look for the sales report file. 3. **Perform Actions:** Read the report, write a summary. 4. **Ask for Confirmation (if needed):** "I've drafted the summary and email. Should I send it now?" 5. **Complete the Task:** Send the email after getting your 'yes'. 6. **Report Back:** Let you know the email has been sent. The **Agent Loop** in Codex acts much like this assistant. It's the central piece of logic that manages the entire conversation and workflow between you and the AI model (like OpenAI's GPT-4). Let's take our simple example: You type `codex "write a python script that prints hello world and run it"`. The Agent Loop is responsible for: 1. Taking your input ("write a python script..."). 2. Sending this request to the powerful AI model via the OpenAI API. 3. Getting the AI's response, which might include: * Text: "Okay, here's the script..." * A request to perform an action (a "function call"): "I need to run this command: `python -c 'print(\"hello world\")'`" 4. Showing you the text part of the response in the [Terminal UI](01_terminal_ui__ink_components_.md). 5. Handling the "function call": * Checking if it needs your permission based on the [Approval Policy](04_approval_policy___security.md). * If needed, asking you "Allow command?" via the UI. * If approved, actually running the command using the [Command Execution & Sandboxing](06_command_execution___sandboxing.md) system. 6. Getting the result of the command (the output "hello world"). 7. Sending that result back to the AI ("I ran the command, and it printed 'hello world'"). 8. Getting the AI's final response (maybe: "Great, the script ran successfully!"). 9. Showing you the final response. 10. Updating the conversation history with everything that happened. It's called a "loop" because it often goes back and forth between you, the AI, and tools (like the command line) until your request is fully handled. ## How It Works: The Conversation Cycle The Agent Loop orchestrates a cycle: ```mermaid graph TD A[User Input] --> B[Agent Loop] B --> C{Send to AI Model} C --> D[AI Response: Text or Tool Call] D --> B B --> E{Process Response} E -- Text --> F[Show Text in UI] E -- Tool Call --> G{Handle Tool Call} G --> H{Needs Approval?} H -- Yes --> I[Ask User via UI] I --> J{User Approves?} H -- No --> K[Execute Tool] J -- Yes --> K J -- No --> L[Report Denial to AI] K --> M[Get Tool Result] M --> B L --> B F --> N[Update History] M --> N L --> N N --> O[Ready for next Input/Step] ``` 1. **Input:** Gets input from you (via the [Input Handling](02_input_handling__textbuffer_editor_.md)). 2. **AI Call:** Sends the current conversation state (including your latest input and any previous steps) to the AI model (OpenAI API). 3. **Response Processing:** Receives the AI's response. This could be simple text, or it could include a request to use a tool (like running a shell command). This is covered more in [Response & Tool Call Handling](05_response___tool_call_handling.md). 4. **Tool Handling:** If the AI requested a tool: * Check the [Approval Policy](04_approval_policy___security.md). * Potentially ask you for confirmation via the [Terminal UI](01_terminal_ui__ink_components_.md). * If approved, execute the tool via [Command Execution & Sandboxing](06_command_execution___sandboxing.md). * Package the tool's result (e.g., command output) to send back to the AI in the next step. 5. **Update State:** Adds the AI's message and any tool results to the conversation history. Shows updates in the UI. 6. **Loop:** If the task isn't finished (e.g., because a tool was used and the AI needs to react to the result), it sends the updated conversation back to the AI (Step 2). If the task *is* finished, it waits for your next input. ## Using the Agent Loop (From the UI's Perspective) You don't directly interact with the `AgentLoop` class code when *using* Codex. Instead, the main UI component (`TerminalChat` in `terminal-chat.tsx`) creates and manages an `AgentLoop` instance. Think of the UI component holding the "remote control" for the Agent Loop assistant. ```tsx // File: codex-cli/src/components/chat/terminal-chat.tsx (Highly Simplified) import React, { useState, useEffect, useRef } from "react"; import { AgentLoop } from "../../utils/agent/agent-loop"; // ... other imports: UI components, config types ... export default function TerminalChat({ config, approvalPolicy, /* ... */ }) { const [items, setItems] = useState([]); // Holds conversation messages const [loading, setLoading] = useState(false); // Is the assistant busy? const [confirmationPrompt, setConfirmationPrompt] = useState(null); // Command to review? const agentRef = useRef(null); // Holds the assistant instance // Create the assistant when the component loads or config changes useEffect(() => { agentRef.current = new AgentLoop({ model: config.model, config: config, approvalPolicy: approvalPolicy, // --- Callbacks: How the assistant reports back --- onItem: (newItem) => { // When the assistant has a message/result setItems((prev) => [...prev, newItem]); // Add it to our chat history }, onLoading: (isLoading) => { // When the assistant starts/stops thinking setLoading(isLoading); }, getCommandConfirmation: async (command, /*...*/) => { // When the assistant needs approval // Show the command in the UI and wait for user's Yes/No const userDecision = await showConfirmationUI(command); return { review: userDecision /* ... */ }; }, // ... other callbacks like onLastResponseId ... }); return () => agentRef.current?.terminate(); // Clean up when done }, [config, approvalPolicy /* ... */]); // --- Function to send user input to the assistant --- const submitInputToAgent = (userInput) => { if (agentRef.current) { // Tell the assistant to process this input agentRef.current.run([userInput /* ... */]); } }; // --- UI Rendering --- return ( {/* Display 'items' using TerminalMessageHistory */} {/* Display input box (TerminalChatInput) or confirmationPrompt */} {/* Pass `submitInputToAgent` to the input box */} {/* Pass function to handle confirmation decision */} ); } ``` * **Initialization:** The UI creates an `AgentLoop`, giving it the necessary configuration ([Configuration Management](07_configuration_management.md)) and crucial **callback functions**. These callbacks are how the Agent Loop communicates back to the UI: * `onItem`: "Here's a new message (from user, AI, or tool) to display." * `onLoading`: "I'm starting/stopping my work." * `getCommandConfirmation`: "I need to run this command. Please ask the user and tell me their decision." * **Running:** When you submit input via the ``, the UI calls the `agentRef.current.run(...)` method, handing off your request to the Agent Loop. * **Updates:** The Agent Loop does its work, calling the `onItem` and `onLoading` callbacks whenever something changes. The UI listens to these callbacks and updates the display accordingly (setting state variables like `items` and `loading`, which causes React to re-render). * **Confirmation:** If the Agent Loop needs approval, it calls `getCommandConfirmation`. The UI pauses, shows the command review prompt, waits for your decision, and then returns the decision back to the Agent Loop, which then proceeds or stops based on your choice. ## Under the Hood: A Step-by-Step Flow Let's trace our "hello world" example again, focusing on the interactions: ```mermaid sequenceDiagram participant User participant InkUI as Terminal UI (Ink) participant AgentLoop participant OpenAI participant CmdExec as Command Execution User->>InkUI: Types "write & run hello world", presses Enter InkUI->>AgentLoop: Calls `run(["write & run..."])` AgentLoop->>AgentLoop: Sets loading=true (calls `onLoading(true)`) InkUI->>User: Shows loading indicator AgentLoop->>OpenAI: Sends request: ["write & run..."] OpenAI-->>AgentLoop: Streams response: [Text: "Okay, try:", ToolCall: `shell(...)`] AgentLoop->>InkUI: Calls `onItem(Text: "Okay, try:")` InkUI->>User: Displays "Okay, try:" AgentLoop->>AgentLoop: Processes ToolCall `shell(...)` Note over AgentLoop: Checks Approval Policy AgentLoop->>InkUI: Calls `getCommandConfirmation(["python", "-c", "..."])` InkUI->>User: Displays "Allow command: python -c '...'?" [Yes/No] User->>InkUI: Clicks/Types 'Yes' InkUI-->>AgentLoop: Returns confirmation result ({ review: YES }) AgentLoop->>CmdExec: Executes `python -c 'print("hello world")'` CmdExec-->>AgentLoop: Returns result (stdout: "hello world", exit code: 0) AgentLoop->>AgentLoop: Creates `function_call_output` item AgentLoop->>OpenAI: Sends request: [..., ToolCall: `shell(...)`, Output: "hello world"] OpenAI-->>AgentLoop: Streams response: [Text: "Command ran successfully!"] AgentLoop->>InkUI: Calls `onItem(Text: "Command ran...")` InkUI->>User: Displays "Command ran successfully!" AgentLoop->>AgentLoop: Sets loading=false (calls `onLoading(false)`) InkUI->>User: Hides loading indicator, shows input prompt ``` This diagram shows the back-and-forth orchestration performed by the Agent Loop, coordinating between the UI, the AI model, and the command execution system. ## Inside `agent-loop.ts` The core logic lives in `codex-cli/src/utils/agent/agent-loop.ts`. Let's peek at a *very* simplified structure: ```typescript // File: codex-cli/src/utils/agent/agent-loop.ts (Simplified) import OpenAI from "openai"; // ... other imports: types for config, responses, approval ... import { handleExecCommand } from "./handle-exec-command"; // For tool calls export class AgentLoop { private oai: OpenAI; // The OpenAI client instance private model: string; private config: AppConfig; private approvalPolicy: ApprovalPolicy; // Callbacks provided by the UI: private onItem: (item: ResponseItem) => void; private onLoading: (loading: boolean) => void; private getCommandConfirmation: (/*...*/) => Promise; // ... other state like current stream, cancellation flags ... constructor({ model, config, approvalPolicy, onItem, onLoading, getCommandConfirmation, /*...*/ }: AgentLoopParams) { this.model = model; this.config = config; this.approvalPolicy = approvalPolicy; this.onItem = onItem; this.onLoading = onLoading; this.getCommandConfirmation = getCommandConfirmation; this.oai = new OpenAI({ /* ... API key, base URL ... */ }); // ... initialize other state ... } // The main method called by the UI public async run(input: Array, previousResponseId: string = ""): Promise { this.onLoading(true); // Signal start let turnInput = input; // Input for this step of the loop let lastResponseId = previousResponseId; try { // Keep looping as long as there's input (initially user msg, later tool results) while (turnInput.length > 0) { // 1. Send current input history to OpenAI API const stream = await this.oai.responses.create({ model: this.model, input: turnInput, // Includes user message or tool results previous_response_id: lastResponseId || undefined, stream: true, // ... other parameters like instructions, tools ... }); turnInput = []; // Clear input for the next loop iteration // 2. Process the stream of events from OpenAI for await (const event of stream) { if (event.type === "response.output_item.done") { const item = event.item; // Could be text, function_call, etc. this.onItem(item as ResponseItem); // Send item to UI to display } if (event.type === "response.completed") { lastResponseId = event.response.id; // Remember the ID for the next call // Check the final output for tool calls for (const item of event.response.output) { if (item.type === "function_call") { // Handle the tool call (ask for approval, execute) // This might add a 'function_call_output' to `turnInput` const toolResults = await this.handleFunctionCall(item); turnInput.push(...toolResults); } } } // ... handle other event types ... } // End stream processing } // End while loop (no more input for this turn) } catch (error) { // ... Handle errors (network issues, API errors etc.) ... this.onItem(/* Create system error message */); } finally { this.onLoading(false); // Signal end } } // Helper to handle tool/function calls private async handleFunctionCall(item: ResponseFunctionToolCall): Promise> { // ... Parse arguments from 'item' ... const args = /* ... parse item.arguments ... */; let outputText = "Error: Unknown function"; let metadata = {}; if (item.name === "shell") { // Example: handle shell commands // This uses the approval policy and getCommandConfirmation callback! const result = await handleExecCommand( args, this.config, this.approvalPolicy, this.getCommandConfirmation, /* ... cancellation signal ... */ ); outputText = result.outputText; metadata = result.metadata; } // ... handle other function names ... // Format the result to send back to OpenAI in the next turn const outputItem: ResponseInputItem.FunctionCallOutput = { type: "function_call_output", call_id: item.call_id, // Link to the specific function call output: JSON.stringify({ output: outputText, metadata }), }; return [outputItem]; // This goes into `turnInput` for the next loop } // ... other methods like cancel(), terminate() ... } ``` * **Constructor:** Sets up the connection to OpenAI and stores the configuration and callbacks passed in by the UI. * **`run()`:** This is the main engine. * It signals loading starts (`onLoading(true)`). * It enters a `while` loop that continues as long as there's something to send to the AI (initially the user's message, later potentially the results from tools). * Inside the loop, it calls `this.oai.responses.create()` to talk to the AI model, sending the current conversation turn. * It processes the `stream` of events coming back from the AI. * For each piece of content (`response.output_item.done`), it calls `onItem` to show it in the UI. * When the AI's turn is complete (`response.completed`), it checks if the AI asked to use any tools (`function_call`). * If a tool call is found, it calls `handleFunctionCall`. * **`handleFunctionCall()`:** * Parses the details of the tool request (e.g., the command arguments). * Uses `handleExecCommand` (which contains logic related to [Approval Policy](04_approval_policy___security.md) and [Command Execution](06_command_execution___sandboxing.md)) to potentially run the command, using the `getCommandConfirmation` callback if needed. * Formats the result of the tool execution (e.g., command output) into a specific `function_call_output` message. * Returns this output message. The `run` method adds this to `turnInput`, so the *next* iteration of the `while` loop will send this result back to the AI, letting it know what happened. * **Finally:** Once the `while` loop finishes (meaning the AI didn't request any more tools in its last response), it signals loading is done (`onLoading(false)`). This loop ensures that the conversation flows logically, handling text, tool requests, user approvals, and tool results in a structured way. ## Conclusion The Agent Loop is the central orchestrator within Codex. It acts like a diligent assistant, taking your requests, interacting with the powerful AI model, managing tools like shell commands, ensuring safety through approvals, and keeping the conversation state updated. It connects the [Terminal UI](01_terminal_ui__ink_components_.md) where you interact, the [Input Handling](02_input_handling__textbuffer_editor_.md) that captures your text, the AI model that provides intelligence, and the systems that actually execute actions ([Command Execution & Sandboxing](06_command_execution___sandboxing.md)). Understanding the Agent Loop helps you see how Codex manages the complex back-and-forth required to turn your natural language requests into concrete actions. But when the Agent Loop wants to run a command suggested by the AI, how does Codex decide whether to ask for your permission first? That crucial safety mechanism is the topic of our next chapter. Next up: [Approval Policy & Security](04_approval_policy___security.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/04_approval_policy___security.md ================================================ --- layout: default title: "Approval Policy & Security" parent: "Codex" nav_order: 4 --- # Chapter 4: Approval Policy & Security In the [previous chapter](03_agent_loop.md), we saw how the **Agent Loop** acts like Codex's brain, talking to the AI and figuring out what steps to take. Sometimes, the AI might suggest actions that could change things on your computer, like modifying a file or running a command in your terminal (e.g., `git commit`, `npm install`, or even `rm important_file.txt`!). This sounds powerful, but also a little scary, right? What if the AI misunderstands and suggests deleting the wrong file? We need a way to control how much power Codex has. That's exactly what the **Approval Policy & Security** system does. It's like a security guard standing between the AI's suggestions and your actual computer. ## What's the Big Idea? The Security Guard Imagine you're visiting a secure building. Depending on your pass, you have different levels of access: * **Guest Pass (`suggest` mode):** You can look around (read files), but if you want to open a door (modify a file) or use special equipment (run a command), you need to ask the guard for permission every single time. * **Employee Badge (`auto-edit` mode):** You can open regular office doors (modify files in the project) without asking each time, but you still need permission for restricted areas like the server room (running commands). * **Full Access Badge (`full-auto` mode):** You can go almost anywhere (modify files, run commands), but for potentially sensitive actions (like running commands), the guard might escort you to a special monitored room (a "sandbox") to ensure safety. The Approval Policy in Codex works just like these passes. It lets *you* choose how much autonomy Codex has when it suggests potentially risky actions. ## Key Concepts: The Approval Modes Codex offers different levels of autonomy, which you can usually set with a command-line flag like `--approval-mode` or when you first configure it. These are the main modes: 1. **`suggest` (Default):** * **What it is:** The most cautious mode. Like the Guest Pass. * **What it does:** Codex can read files to understand your project, but before it *modifies* any file or *runs* any command, it will always stop and ask for your explicit permission through the [Terminal UI](01_terminal_ui__ink_components_.md). * **Use when:** You want maximum control and want to review every single change or command. 2. **`auto-edit`:** * **What it is:** Allows automatic file edits, but still requires approval for commands. Like the Employee Badge. * **What it does:** Codex can automatically apply changes (patches) to files within your project directory. However, if it wants to run a shell command (like `npm install`, `git commit`, `python script.py`), it will still stop and ask for your permission. * **Use when:** You trust the AI to make code changes but still want to manually approve any commands it tries to run. 3. **`full-auto`:** * **What it is:** The most autonomous mode, allowing file edits and command execution, but with safeguards. Like the Full Access Badge with escort. * **What it does:** Codex can automatically apply file changes *and* run shell commands without asking you first. Crucially, to prevent accidental damage, commands run in this mode are typically executed inside a **sandbox** – a restricted environment that limits what the command can do (e.g., blocking network access, limiting file access to the project directory). We'll learn more about this in the [Command Execution & Sandboxing](06_command_execution___sandboxing.md) chapter. * **Use when:** You want Codex to work as independently as possible, understanding that potentially risky commands are run with safety restrictions. ## How it Works in Practice When the [Agent Loop](03_agent_loop.md) receives a suggestion from the AI to perform an action (like applying a patch or running a shell command), it doesn't just blindly execute it. Instead, it checks the current Approval Policy you've set. ```mermaid sequenceDiagram participant AgentLoop as Agent Loop participant ApprovalCheck as Approval Policy Check participant UserUI as Terminal UI participant CmdExec as Command Execution AgentLoop->>AgentLoop: AI suggests action (e.g., run `npm install`) AgentLoop->>ApprovalCheck: Check action against policy (`auto-edit`) ApprovalCheck->>ApprovalCheck: Action is `npm install` (command) ApprovalCheck->>ApprovalCheck: Policy is `auto-edit` (commands need approval) ApprovalCheck-->>AgentLoop: Decision: `ask-user` AgentLoop->>UserUI: Request confirmation for `npm install` UserUI->>UserUI: Display "Allow command `npm install`? [Y/n]" UserUI-->>AgentLoop: User response (e.g., Yes) AgentLoop->>CmdExec: Execute `npm install` ``` 1. **Suggestion:** The AI tells the Agent Loop it wants to run `npm install`. 2. **Check Policy:** The Agent Loop asks the Approval Policy system: "The AI wants to run `npm install`. The user set the policy to `auto-edit`. Is this okay?" 3. **Decision:** The Approval Policy system checks its rules: * The action is a shell command. * The policy is `auto-edit`. * Rule: In `auto-edit` mode, shell commands require user approval. * Result: The decision is `ask-user`. 4. **Ask User:** The Agent Loop receives the `ask-user` decision and uses the `getCommandConfirmation` callback (provided by the [Terminal UI](01_terminal_ui__ink_components_.md)) to display the prompt to you. 5. **User Response:** You see the prompt and respond (e.g., 'Yes'). 6. **Execute (if approved):** The Agent Loop receives your 'Yes' and proceeds to execute the command, potentially using the [Command Execution & Sandboxing](06_command_execution___sandboxing.md) system. If the policy had been `full-auto`, the decision in Step 3 might have been `auto-approve` (with `runInSandbox: true`), and the Agent Loop would have skipped asking you (Steps 4 & 5) and gone straight to execution (Step 6), but inside the sandbox. If the action was applying a file patch and the policy was `auto-edit` or `full-auto`, the decision might also be `auto-approve` (checking if the file path is allowed), skipping the user prompt. ## Under the Hood: The `approvals.ts` Logic The core logic for making these decisions lives in `codex-cli/src/approvals.ts`. A key function here is `canAutoApprove`. ```typescript // File: codex-cli/src/approvals.ts (Simplified) // Represents the different approval modes export type ApprovalPolicy = "suggest" | "auto-edit" | "full-auto"; // Represents the outcome of the safety check export type SafetyAssessment = | { type: "auto-approve"; runInSandbox: boolean; reason: string; /*...*/ } | { type: "ask-user"; applyPatch?: ApplyPatchCommand } | { type: "reject"; reason: string }; // Input for apply_patch commands export type ApplyPatchCommand = { patch: string; }; /** * Checks if a command can be run automatically based on the policy. */ export function canAutoApprove( command: ReadonlyArray, // e.g., ["git", "status"] or ["apply_patch", "..."] policy: ApprovalPolicy, writableRoots: ReadonlyArray, // Allowed directories for edits // ... env ... ): SafetyAssessment { // --- Special case: apply_patch --- if (command[0] === "apply_patch") { // Check if policy allows auto-editing and if patch only affects allowed files const applyPatchArg = command[1] as string; const patchDetails = { patch: applyPatchArg }; if (policy === "suggest") return { type: "ask-user", applyPatch: patchDetails }; if (isWritePatchConstrainedToWritablePaths(applyPatchArg, writableRoots)) { return { type: "auto-approve", runInSandbox: false, reason: "Patch affects allowed files", /*...*/ }; } // If policy is auto-edit but patch affects disallowed files, ask user. // If policy is full-auto, still approve but mark for sandbox if paths are weird. return policy === "full-auto" ? { type: "auto-approve", runInSandbox: true, reason: "Full auto mode", /*...*/ } : { type: "ask-user", applyPatch: patchDetails }; } // --- Check for known safe, read-only commands --- const knownSafe = isSafeCommand(command); // Checks things like "ls", "pwd", "git status" if (knownSafe != null) { return { type: "auto-approve", runInSandbox: false, reason: knownSafe.reason, /*...*/ }; } // --- Handle shell commands (like "bash -lc 'npm install'") --- // (Simplified: assumes any other command needs policy check) // --- Default: Check policy for general commands --- if (policy === "full-auto") { return { type: "auto-approve", runInSandbox: true, reason: "Full auto mode", /*...*/ }; } else { // 'suggest' and 'auto-edit' require asking for commands return { type: "ask-user" }; } } // Helper to check if a command is known to be safe (read-only) function isSafeCommand(command: ReadonlyArray): { reason: string, group: string } | null { const cmd = command[0]; if (["ls", "pwd", "cat", "git status", "git diff", /*...*/].includes(cmd)) { return { reason: `Safe read-only command: ${cmd}`, group: "Reading" }; } return null; } // Helper (simplified) to check if patch affects allowed paths function isWritePatchConstrainedToWritablePaths( patch: string, writableRoots: ReadonlyArray ): boolean { // ... logic to parse patch and check affected file paths ... // ... return true if all paths are within writableRoots ... return true; // Simplified for example } ``` * **Inputs:** `canAutoApprove` takes the command the AI wants to run (as an array of strings, like `["npm", "install"]`), the current `ApprovalPolicy` (`suggest`, `auto-edit`, or `full-auto`), and a list of directories where file edits are allowed (`writableRoots`, usually just your project's main folder). * **Checks:** It first handles special cases like `apply_patch` (checking the policy and file paths) and known safe, read-only commands using `isSafeCommand`. * **Policy Decision:** For other commands, it primarily relies on the policy: * If `full-auto`, it returns `auto-approve` but sets `runInSandbox` to `true`. * If `suggest` or `auto-edit`, it returns `ask-user`. * **Output:** It returns a `SafetyAssessment` object telling the [Agent Loop](03_agent_loop.md) what to do: `auto-approve` (and whether sandboxing is needed), `ask-user`, or in rare cases, `reject` (if the command is fundamentally invalid). This decision is then used back in the Agent Loop, often within a function like `handleExecCommand` (in `handle-exec-command.ts`), which we touched on in the previous chapter. ```typescript // File: codex-cli/src/utils/agent/handle-exec-command.ts (Simplified snippet) import { canAutoApprove } from "../../approvals.js"; import { ReviewDecision } from "./review.js"; // ... other imports ... export async function handleExecCommand( args: ExecInput, // Contains the command array `cmd` config: AppConfig, policy: ApprovalPolicy, getCommandConfirmation: (/*...*/) => Promise, // UI callback // ... abortSignal ... ): Promise { // *** Check the approval policy first! *** const safety = canAutoApprove(args.cmd, policy, [process.cwd()]); let runInSandbox: boolean; switch (safety.type) { case "ask-user": { // Policy requires asking the user const { review: decision } = await getCommandConfirmation(args.cmd, safety.applyPatch); if (decision !== ReviewDecision.YES && decision !== ReviewDecision.ALWAYS) { // User said No or provided feedback to stop return { outputText: "aborted", metadata: { /*...*/ } }; } // User approved! Proceed without sandbox (unless policy changes later). runInSandbox = false; break; } case "auto-approve": { // Policy allows auto-approval runInSandbox = safety.runInSandbox; // Respect sandbox flag from canAutoApprove break; } case "reject": { // Policy outright rejected the command return { outputText: "aborted", metadata: { reason: safety.reason } }; } } // *** If approved (either automatically or by user), execute the command *** const summary = await execCommand(args, safety.applyPatch, runInSandbox, /*...*/); // ... handle results ... return convertSummaryToResult(summary); } ``` This shows how `canAutoApprove` is called first. If it returns `ask-user`, the `getCommandConfirmation` callback (which triggers the UI prompt) is invoked. Only if the assessment is `auto-approve` or the user explicitly approves does the code proceed to actually execute the command using `execCommand`, passing the `runInSandbox` flag determined by the policy check. ## Conclusion The Approval Policy & Security system is Codex's safety net. It puts you in control, letting you choose the balance between letting the AI work autonomously and requiring manual confirmation for actions that could affect your system. By understanding the `suggest`, `auto-edit`, and `full-auto` modes, you can configure Codex to operate in a way that matches your comfort level with automation and risk. This system works hand-in-hand with the [Agent Loop](03_agent_loop.md) to intercept potentially risky actions and enforce the rules you've set, sometimes using sandboxing (as we'll see later) for an extra layer of protection. Now that we know how Codex decides *whether* to perform an action, how does it actually understand the AI's response, especially when the AI wants to use a tool like running a command or applying a patch? Next up: [Response & Tool Call Handling](05_response___tool_call_handling.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/05_response___tool_call_handling.md ================================================ --- layout: default title: "Response & Tool Call Handling" parent: "Codex" nav_order: 5 --- # Chapter 5: Response & Tool Call Handling In the [previous chapter](04_approval_policy___security.md), we learned how Codex decides *if* it's allowed to perform an action suggested by the AI, acting like a security guard based on the rules you set. But how does Codex understand the AI's response in the first place, especially when the AI wants to do something specific, like run a command or change a file? That's where **Response & Tool Call Handling** comes in. Think of this part of Codex as its "ears" and "hands." It listens carefully to the instructions coming back from the AI model (the "response") and, if the AI asks to perform an action (a "tool call"), it figures out *exactly* what the AI wants to do (like which command to run or what file change to make) and gets ready to do it. ## What's the Big Idea? Listening to the AI Assistant Imagine you ask your super-smart assistant (the AI model) to do something like: `codex "What's the status of my project? Use git status."` The AI doesn't just send back plain text like "Okay, I'll run it." Instead, it sends back a more structured message, almost like filling out a form: * **Text Part:** "Okay, I will check the status of your project." * **Action Part (Tool Call):** * **Tool Name:** `shell` (meaning: use the command line) * **Arguments:** `["git", "status"]` (meaning: the specific command to run) Codex needs to understand this structured response. It needs to: 1. Recognize the plain text part and show it to you in the [Terminal UI](01_terminal_ui__ink_components_.md). 2. See the "Action Part" (the Tool Call) and understand: * Which tool the AI wants to use (`shell`). * What specific details (arguments) are needed for that tool (`git status`). This system is crucial because it translates the AI's intent into something Codex can actually act upon. ## Key Concepts 1. **Structured Responses:** The OpenAI API doesn't just return a single block of text. It sends back data structured often like JSON. This allows the AI to clearly separate regular conversation text from requests to perform actions. ```json // Simplified idea of an AI response { "id": "response_123", "output": [ { "type": "message", // A regular text message "role": "assistant", "content": [{ "type": "output_text", "text": "Okay, checking the status..." }] }, { "type": "function_call", // A request to use a tool! "name": "shell", "arguments": "{\"command\": [\"git\", \"status\"]}", // Details for the tool "call_id": "call_abc" } ] // ... other info ... } ``` This structure makes it easy for Codex to programmatically understand the different parts of the AI's message. 2. **Tool Calls (Function Calls):** When the AI wants to interact with the outside world (run a command, edit a file), it uses a special type of message in the response, often called a "function call" or "tool call". In Codex, common tool names are: * `shell`: Execute a command in the terminal. * `apply_patch`: Modify a file using a specific format called a "patch". 3. **Arguments:** The tool call includes the necessary details, called "arguments," usually formatted as a JSON string. * For the `shell` tool, the arguments specify the command to run (e.g., `{"command": ["git", "status"]}`). * For the `apply_patch` tool, the arguments contain the patch text describing the file changes (e.g., `{"patch": "*** Begin Patch..."}`). ## How It Works: Decoding the AI's Message When the [Agent Loop](03_agent_loop.md) receives a response from the OpenAI API, it goes through these steps: ```mermaid sequenceDiagram participant OpenAI participant AgentLoop as Agent Loop participant Parser as Response Parser participant UI as Terminal UI participant Approval as Approval Check OpenAI-->>AgentLoop: Sends structured response (Text + Tool Call) AgentLoop->>Parser: Passes raw response data Parser->>Parser: Extracts Text part ("Okay...") Parser-->>AgentLoop: Returns extracted Text AgentLoop->>UI: Sends Text to display ("onItem" callback) Parser->>Parser: Extracts Tool Call part (shell, ["git", "status"]) Parser-->>AgentLoop: Returns Tool Name ("shell") & Arguments (["git", "status"]) AgentLoop->>Approval: Sends Tool details for policy check Note over Approval: Next step: Chapter 4/6 ``` 1. **Receive Response:** The [Agent Loop](03_agent_loop.md) gets the structured response data from the OpenAI API. 2. **Parse:** It uses helper functions (often found in `utils/parsers.ts`) to examine the response structure. 3. **Extract Text:** If there's a regular text message (`"type": "message"`), it's extracted and sent to the [Terminal UI](01_terminal_ui__ink_components_.md) via the `onItem` callback to be displayed. 4. **Extract Tool Call:** If there's a tool call (`"type": "function_call"`): * The **tool name** (e.g., `shell`) is identified. * The **arguments** string is extracted. * The arguments string (which is often JSON) is parsed to get the actual details (e.g., the `command` array `["git", "status"]`). 5. **Prepare for Action:** The Agent Loop now knows the specific tool and its arguments. It packages this information (tool name + parsed arguments) and prepares for the next stage: checking the [Approval Policy & Security](04_approval_policy___security.md) and, if approved, proceeding to [Command Execution & Sandboxing](06_command_execution___sandboxing.md). ## Under the Hood: Parsing the Details Let's look at simplified code snippets showing how this parsing happens. ### In the Agent Loop (`agent-loop.ts`) The `AgentLoop` processes events streamed from the OpenAI API. When a complete response arrives or a specific tool call item is identified, it needs handling. ```typescript // File: codex-cli/src/utils/agent/agent-loop.ts (Simplified) // Inside the loop processing OpenAI stream events... for await (const event of stream) { if (event.type === "response.output_item.done") { const item = event.item; // Could be text, function_call, etc. this.onItem(item as ResponseItem); // Send to UI // If it's a tool call, mark it for later processing if (item.type === "function_call") { // Store item.call_id or item details // to handle after the stream finishes } } if (event.type === "response.completed") { // Process the full response output once the stream is done for (const item of event.response.output) { if (item.type === "function_call") { // *** This is where we handle the tool call! *** // Calls a helper function like handleFunctionCall const toolResults = await this.handleFunctionCall(item); // Prepare results to potentially send back to AI turnInput.push(...toolResults); } } lastResponseId = event.response.id; } // ... other event types ... } // Helper function to process the tool call details private async handleFunctionCall(item: ResponseFunctionToolCall): Promise> { const name = item.name; // e.g., "shell" const rawArguments = item.arguments; // e.g., "{\"command\": [\"git\", \"status\"]}" const callId = item.call_id; // *** Use a parser to get structured arguments *** const args = parseToolCallArguments(rawArguments ?? "{}"); // From parsers.ts if (args == null) { // Handle error: arguments couldn't be parsed return [/* error output item */]; } let outputText = `Error: Unknown function ${name}`; let metadata = {}; // Check which tool was called if (name === "shell") { // *** Prepare for execution *** // Call handleExecCommand, which checks approval and runs the command const result = await handleExecCommand( args, // Contains { cmd: ["git", "status"], ... } this.config, this.approvalPolicy, this.getCommandConfirmation, // Function to ask user via UI /* ... cancellation signal ... */ ); outputText = result.outputText; metadata = result.metadata; } else if (name === "apply_patch") { // Similar logic, potentially using execApplyPatch after approval check // It would parse args.patch using logic from parse-apply-patch.ts } // ... other tools ... // Create the result message to send back to the AI const outputItem: ResponseInputItem.FunctionCallOutput = { type: "function_call_output", call_id: callId, output: JSON.stringify({ output: outputText, metadata }), }; return [outputItem]; } ``` * The loop iterates through the response `output` items. * If an item is a `function_call`, the `handleFunctionCall` helper is called. * `handleFunctionCall` extracts the `name` and `arguments`. * It crucially calls `parseToolCallArguments` (from `utils/parsers.ts`) to turn the JSON string `arguments` into a usable object. * Based on the `name` (`shell`, `apply_patch`), it calls the appropriate execution handler (like `handleExecCommand`), passing the parsed arguments. This handler coordinates with the [Approval Policy & Security](04_approval_policy___security.md) and [Command Execution & Sandboxing](06_command_execution___sandboxing.md) systems. ### In the Parsers (`parsers.ts`) This file contains helpers to decode the tool call details. ```typescript // File: codex-cli/src/utils/parsers.ts (Simplified) import { formatCommandForDisplay } from "src/format-command.js"; // ... other imports ... /** * Parses the raw JSON string from a tool call's arguments. * Expects specific shapes for known tools like 'shell'. */ export function parseToolCallArguments( rawArguments: string, ): ExecInput | undefined { // ExecInput contains { cmd, workdir, timeoutInMillis } let json: unknown; try { json = JSON.parse(rawArguments); // Basic JSON parsing } catch (err) { // Handle JSON parse errors return undefined; } if (typeof json !== "object" || json == null) return undefined; // Look for 'command' or 'cmd' property, expecting an array of strings const { cmd, command, patch /* other possible args */ } = json as Record; const commandArray = toStringArray(cmd) ?? toStringArray(command); // If it's a shell command, require the command array if (commandArray != null) { return { cmd: commandArray, // Optional: extract workdir and timeout too workdir: typeof (json as any).workdir === "string" ? (json as any).workdir : undefined, timeoutInMillis: typeof (json as any).timeout === "number" ? (json as any).timeout : undefined, }; } // If it's an apply_patch command, require the patch string if (typeof patch === 'string') { // Return a structure indicating it's a patch, maybe: // return { type: 'patch', patch: patch }; // Or incorporate into ExecInput if unified // For simplicity here, let's assume handleFunctionCall routes based on name, // so we might just return the raw parsed JSON for patch. // But a structured return is better. Let's adapt ExecInput slightly for demo: return { cmd: ['apply_patch'], patch: patch }; // Use a placeholder cmd } return undefined; // Unknown or invalid arguments structure } // Helper to check if an object is an array of strings function toStringArray(obj: unknown): Array | undefined { if (Array.isArray(obj) && obj.every((item) => typeof item === "string")) { return obj as Array; } return undefined; } /** * Parses a full FunctionCall item for display/review purposes. */ export function parseToolCall( toolCall: ResponseFunctionToolCall, ): CommandReviewDetails | undefined { // CommandReviewDetails has { cmd, cmdReadableText, ... } // Use the argument parser const args = parseToolCallArguments(toolCall.arguments); if (args == null) return undefined; // Format the command nicely for display const cmdReadableText = formatCommandForDisplay(args.cmd); // ... potentially add auto-approval info ... return { cmd: args.cmd, cmdReadableText: cmdReadableText, // ... other details ... }; } ``` * `parseToolCallArguments` takes the raw JSON string (`{"command": ["git", "status"]}`) and uses `JSON.parse`. * It then checks if the parsed object has the expected structure (e.g., a `command` property that is an array of strings for `shell`, or a `patch` string for `apply_patch`). * It returns a structured object (`ExecInput`) containing the validated arguments, or `undefined` if parsing fails. * `parseToolCall` uses `parseToolCallArguments` and then formats the command nicely for display using `formatCommandForDisplay`. ### Handling Patches (`parse-apply-patch.ts`) When the tool is `apply_patch`, the arguments contain a multi-line string describing the changes. Codex has specific logic to parse this format. ```typescript // File: codex-cli/src/utils/agent/parse-apply-patch.ts (Conceptual) // Defines types like ApplyPatchOp (create, delete, update) export function parseApplyPatch(patch: string): Array | null { // 1. Check for "*** Begin Patch" and "*** End Patch" markers. if (!patch.startsWith("*** Begin Patch\n") || !patch.endsWith("\n*** End Patch")) { return null; // Invalid format } // 2. Extract the body between the markers. const patchBody = /* ... extract body ... */; const lines = patchBody.split('\n'); const operations: Array = []; for (const line of lines) { // 3. Check for operation markers: if (line.startsWith("*** Add File: ")) { operations.push({ type: "create", path: /* path */, content: "" }); } else if (line.startsWith("*** Delete File: ")) { operations.push({ type: "delete", path: /* path */ }); } else if (line.startsWith("*** Update File: ")) { operations.push({ type: "update", path: /* path */, update: "", added: 0, deleted: 0 }); } else if (operations.length > 0) { // 4. If inside an operation, parse the content/diff lines (+/-) const lastOp = operations[operations.length - 1]; // ... add line content to create/update operation ... } else { // Invalid line outside of an operation return null; } } return operations; // Return the list of parsed operations } ``` This parser specifically understands the `*** Add File:`, `*** Delete File:`, `*** Update File:` markers and the `+`/`-` lines within patches to figure out exactly which files to change and how. ### Displaying Tool Calls (`terminal-chat-response-item.tsx`) The UI needs to show tool calls differently from regular messages. ```tsx // File: codex-cli/src/components/chat/terminal-chat-response-item.tsx (Simplified) import { parseToolCall } from "../../utils/parsers"; // ... other imports: Box, Text from ink ... export default function TerminalChatResponseItem({ item }: { item: ResponseItem }): React.ReactElement { switch (item.type) { case "message": // ... render regular message ... break; case "function_call": // <-- Handle tool calls return ; case "function_call_output": // ... render tool output ... break; // ... other cases ... } // ... fallback ... } function TerminalChatResponseToolCall({ message }: { message: ResponseFunctionToolCallItem }) { // Use the parser to get displayable details const details = parseToolCall(message); // From parsers.ts if (!details) return Invalid tool call; return ( command {/* Display the nicely formatted command */} $ {details.cmdReadableText} ); } ``` * The main component checks the `item.type`. * If it's `function_call`, it renders a specific component (`TerminalChatResponseToolCall`). * This component uses `parseToolCall` (from `utils/parsers.ts`) to get the details and displays the command in a distinct style (e.g., with a `$` prefix and magenta color). ## Conclusion You've now seen how Codex acts as an interpreter for the AI. It doesn't just receive text; it receives structured instructions. The **Response & Tool Call Handling** system is responsible for parsing these instructions, figuring out if the AI wants to use a tool (like `shell` or `apply_patch`), and extracting the precise arguments needed for that tool. This crucial step translates the AI's intentions into actionable details that Codex can then use to interact with your system, always respecting the rules set by the [Approval Policy & Security](04_approval_policy___security.md). Now that Codex understands *what* command the AI wants to run (e.g., `git status`), how does it actually *execute* that command safely, especially if running in `full-auto` mode? That's the topic of our next chapter. Next up: [Command Execution & Sandboxing](06_command_execution___sandboxing.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/06_command_execution___sandboxing.md ================================================ --- layout: default title: "Command Execution & Sandboxing" parent: "Codex" nav_order: 6 --- # Chapter 6: Command Execution & Sandboxing In the [previous chapter](05_response___tool_call_handling.md), we learned how Codex listens to the AI and understands when it wants to use a tool, like running a specific shell command (`git status` or `npm install`). We also know from the [Approval Policy & Security](04_approval_policy___security.md) chapter that Codex checks if it *should* run the command based on your chosen safety level. But once Codex has the command and permission (either from you or automatically), how does it actually *run* that command? And how does it do it safely, especially if you've given it more freedom in `full-auto` mode? That's the job of the **Command Execution & Sandboxing** system. ## What's the Big Idea? The Workshop Safety Zones Imagine Codex is working in a workshop. This system is like the different areas and safety procedures in that workshop: * **The Main Workbench (Raw Execution):** For simple, safe tasks (like running `ls` to list files), Codex might just use the tools directly on the main workbench. It's straightforward, but you wouldn't use dangerous chemicals there. * **The Safety Cage (Sandboxing):** For potentially risky tasks (like testing a powerful new tool, or maybe running a command the AI suggested that you haven't manually approved in `full-auto` mode), Codex moves the work inside a special safety cage. This cage has reinforced walls and maybe limited power outlets, preventing any accidents from affecting the rest of the workshop. This system takes a command requested by the AI (like `python script.py` or `git commit -m "AI commit"`) and actually runs it on your computer's command line. Crucially, it decides *whether* to run it directly (on the workbench) or inside a restricted environment (the safety cage or "sandbox"). It also collects the results – what the command printed (output/stdout), any errors (stderr), and whether it finished successfully (exit code). ## Key Concepts 1. **Raw Execution:** * **What:** Running the command directly using your system's shell, just like you would type it. * **When:** Used for commands deemed safe, or when you explicitly approve a command in `suggest` or `auto-edit` mode. * **Pros:** Simple, has full access to your environment (which might be needed). * **Cons:** If the AI makes a mistake and suggests a harmful command, running it raw could cause problems. 2. **Sandboxing:** * **What:** Running the command inside a restricted environment that limits what it can do. Think of it as putting the command in "jail." * **How (Examples):** * **macOS Seatbelt:** Uses a built-in macOS feature (`sandbox-exec`) with a specific policy file to strictly control what the command can access (e.g., only allow writing to the project folder, block network access). * **Docker Container:** Runs the command inside a lightweight container (like the one defined in `codex-cli/Dockerfile`). This container has only specific tools installed and can have network rules applied (using `iptables`/`ipset` via `init_firewall.sh`) to limit internet access. * **When:** Typically used automatically in `full-auto` mode (as decided by the [Approval Policy & Security](04_approval_policy___security.md) check), or potentially if a specific command is flagged as needing extra caution. * **Pros:** Significantly reduces the risk of accidental damage from faulty or malicious commands suggested by the AI. * **Cons:** Might prevent a command from working if it legitimately needs access to something the sandbox blocks (like a specific system file or network resource). The setup can be more complex. ## How It Works: From Approval to Execution The Command Execution system doesn't decide *whether* to run a command – that's the job of the [Approval Policy & Security](04_approval_policy___security.md). This system comes into play *after* the approval check. Remember the `handleExecCommand` function from the [Agent Loop](03_agent_loop.md) chapter? It first calls `canAutoApprove` ([Approval Policy & Security](04_approval_policy___security.md)). If the command is approved (either by policy or by you), `canAutoApprove` tells `handleExecCommand` *whether* sandboxing is needed (`runInSandbox: true` or `runInSandbox: false`). ```typescript // File: codex-cli/src/utils/agent/handle-exec-command.ts (Simplified Snippet) import { execCommand } from "./exec-command-helper"; // (Conceptual helper name) import { getSandbox } from "./sandbox-selector"; // (Conceptual helper name) // ... other imports: canAutoApprove, config, policy types ... async function handleExecCommand( args: ExecInput, // Contains { cmd: ["git", "status"], ... } config: AppConfig, policy: ApprovalPolicy, getCommandConfirmation: (/*...*/) => Promise, // ... abortSignal ... ): Promise { // 1. Check policy (calls canAutoApprove) const safety = canAutoApprove(command, policy, [process.cwd()]); let runInSandbox: boolean; // 2. Determine if approved and if sandbox needed switch (safety.type) { case "ask-user": // Ask user via getCommandConfirmation... // If approved, runInSandbox = false; break; case "auto-approve": runInSandbox = safety.runInSandbox; // Get sandbox flag from policy check break; // ... handle reject ... } // 3. *** Execute the command! *** // Determine the actual sandbox mechanism (Seatbelt, Docker, None) const sandboxType = await getSandbox(runInSandbox); // Call the function that handles execution const summary = await execCommand( args, applyPatch, // (if it was an apply_patch command) sandboxType, abortSignal, ); // 4. Format and return results return convertSummaryToResult(summary); } ``` * **Steps 1 & 2:** Approval policy is checked, maybe the user is asked. We get the `runInSandbox` boolean. * **Step 3:** A helper (`getSandbox`) determines the specific `SandboxType` (e.g., `MACOS_SEATBELT` or `NONE`) based on `runInSandbox` and the operating system. Then, the core execution function (`execCommand`) is called, passing the command details and the chosen `sandboxType`. * **Step 4:** The results (stdout, stderr, exit code) from `execCommand` are packaged up. ## Under the Hood: Running the Command Let's trace the execution flow: ```mermaid sequenceDiagram participant HEC as handleExecCommand participant EC as execCommand (Helper) participant Exec as exec (exec.ts) participant Raw as rawExec (raw-exec.ts) participant SB as execWithSeatbelt (macos-seatbelt.ts) HEC->>EC: Run `git status`, sandboxType=NONE EC->>Exec: Calls exec({cmd: ["git", "status"], ...}, SandboxType.NONE) Exec->>Exec: Selects rawExec based on sandboxType Exec->>Raw: Calls rawExec(["git", "status"], ...) Raw->>NodeJS: Uses child_process.spawn("git", ["status"], ...) NodeJS-->>Raw: Command finishes (stdout, stderr, code) Raw-->>Exec: Returns result Exec-->>EC: Returns result EC-->>HEC: Returns final summary %% Example with Sandbox %% HEC->>EC: Run `dangerous_script.sh`, sandboxType=MACOS_SEATBELT EC->>Exec: Calls exec({cmd: ["dangerous..."], ...}, SandboxType.MACOS_SEATBELT) Exec->>Exec: Selects execWithSeatbelt based on sandboxType Exec->>SB: Calls execWithSeatbelt(["dangerous..."], ...) SB->>SB: Constructs `sandbox-exec` command with policy SB->>Raw: Calls rawExec(["sandbox-exec", "-p", policy, "--", "dangerous..."], ...) Raw->>NodeJS: Uses child_process.spawn("sandbox-exec", [...]) NodeJS-->>Raw: Sandboxed command finishes (stdout, stderr, code) Raw-->>SB: Returns result SB-->>Exec: Returns result Exec-->>EC: Returns result EC-->>HEC: Returns final summary ``` ### The Entry Point: `exec.ts` This file acts as a router. It takes the command and the desired `SandboxType` and calls the appropriate execution function. ```typescript // File: codex-cli/src/utils/agent/exec.ts (Simplified) import type { ExecInput, ExecResult, SandboxType } from "./sandbox/interface.js"; import { execWithSeatbelt } from "./sandbox/macos-seatbelt.js"; import { exec as rawExec } from "./sandbox/raw-exec.js"; // ... other imports like process_patch for apply_patch ... // Never rejects, maps errors to non-zero exit code / stderr export function exec( { cmd, workdir, timeoutInMillis }: ExecInput, sandbox: SandboxType, // e.g., NONE, MACOS_SEATBELT abortSignal?: AbortSignal, ): Promise { // Decide which execution function to use const execFunction = sandbox === SandboxType.MACOS_SEATBELT ? execWithSeatbelt : rawExec; const opts: SpawnOptions = { /* ... set timeout, workdir ... */ }; const writableRoots = [process.cwd(), os.tmpdir()]; // Basic allowed paths // Call the chosen function (either raw or sandboxed) return execFunction(cmd, opts, writableRoots, abortSignal); } // Special handler for apply_patch pseudo-command export function execApplyPatch(patchText: string): ExecResult { try { // Use file system operations directly (fs.writeFileSync etc.) const result = process_patch(/* ... patchText, fs functions ... */); return { stdout: result, stderr: "", exitCode: 0 }; } catch (error: unknown) { // Handle errors during patching return { stdout: "", stderr: String(error), exitCode: 1 }; } } ``` * It receives the command (`cmd`), options (`workdir`, `timeout`), and the `sandbox` type. * It checks the `sandbox` type and chooses either `execWithSeatbelt` (for macOS sandbox) or `rawExec` (for direct execution). * It calls the selected function. * Note: `apply_patch` is handled specially by `execApplyPatch`, which directly uses Node.js file system functions instead of spawning a shell command. ### Raw Execution: `raw-exec.ts` This function runs the command directly using Node.js's built-in `child_process.spawn`. ```typescript // File: codex-cli/src/utils/agent/sandbox/raw-exec.ts (Simplified) import type { ExecResult } from "./interface"; import { spawn, type SpawnOptions } from "child_process"; import { log, isLoggingEnabled } from "../log.js"; const MAX_BUFFER = 1024 * 100; // 100 KB limit for stdout/stderr // Never rejects, maps errors to non-zero exit code / stderr export function exec( command: Array, // e.g., ["git", "status"] options: SpawnOptions, _writableRoots: Array, // Not used in raw exec abortSignal?: AbortSignal, ): Promise { const prog = command[0]; const args = command.slice(1); return new Promise((resolve) => { // Spawn the child process const child = spawn(prog, args, { ...options, stdio: ["ignore", "pipe", "pipe"], // Don't wait for stdin, capture stdout/err detached: true, // Allows killing process group on abort }); // Handle abort signal if provided if (abortSignal) { // Add listener to kill child process if aborted // ... abort handling logic ... } let stdout = ""; let stderr = ""; // Capture stdout/stderr, respecting MAX_BUFFER limit child.stdout?.on("data", (data) => { /* append to stdout if under limit */ }); child.stderr?.on("data", (data) => { /* append to stderr if under limit */ }); // Handle process exit child.on("exit", (code, signal) => { resolve({ stdout, stderr, exitCode: code ?? 1 }); }); // Handle errors like "command not found" child.on("error", (err) => { resolve({ stdout: "", stderr: String(err), exitCode: 1 }); }); }); } ``` * It uses `child_process.spawn` to run the command. `spawn` is generally safer than `exec` as it doesn't involve an intermediate shell unless explicitly requested. * It captures `stdout` and `stderr` data, enforcing a maximum buffer size to prevent memory issues. * It listens for the `exit` event to get the exit code. * It listens for the `error` event (e.g., if the command executable doesn't exist). * It includes logic to kill the child process if the `abortSignal` is triggered (e.g., user presses Ctrl+C). * Crucially, it always `resolve`s the promise, even on errors, packaging the error into the `ExecResult`. ### Sandboxing on macOS: `macos-seatbelt.ts` This function wraps the command execution using macOS's `sandbox-exec` tool. ```typescript // File: codex-cli/src/utils/agent/sandbox/macos-seatbelt.ts (Simplified) import type { ExecResult } from "./interface.js"; import { exec as rawExec } from "./raw-exec.js"; // Uses raw exec internally! import { log } from "../log.js"; const READ_ONLY_POLICY_BASE = ` (version 1) (deny default) (allow file-read*) ; Allow reading most things (allow process-exec process-fork signal) ; Allow running/forking (allow sysctl-read) ; Allow reading system info ; ... more base rules ... `; // Runs command inside macOS Seatbelt sandbox export function execWithSeatbelt( cmd: Array, // The original command e.g., ["python", "script.py"] opts: SpawnOptions, writableRoots: Array, // Dirs allowed for writing, e.g., project root abortSignal?: AbortSignal, ): Promise { // 1. Build the sandbox policy string let policy = READ_ONLY_POLICY_BASE; let policyParams: Array = []; if (writableRoots.length > 0) { // Add rules to allow writing ONLY within specified roots const writeRules = writableRoots.map( (root, i) => `(allow file-write* (subpath (param "WR_${i}")))` ).join("\n"); policy += `\n${writeRules}`; // Create parameters for sandbox-exec policyParams = writableRoots.map((root, i) => `-DWR_${i}=${root}`); } log(`Seatbelt Policy: ${policy}`); // 2. Construct the actual command to run: sandbox-exec + policy + original command const fullCommand = [ "sandbox-exec", "-p", policy, // Pass the policy string ...policyParams, // Pass parameters like -DWR_0=/path/to/project "--", // End of sandbox-exec options ...cmd, // The original command and arguments ]; // 3. Execute the `sandbox-exec` command using rawExec return rawExec(fullCommand, opts, [], abortSignal); // writableRoots not needed by rawExec here } ``` * It defines a base Seatbelt policy (`.sb` file format) that denies most actions by default but allows basic read operations and process execution. * It dynamically adds `allow file-write*` rules for the specific `writableRoots` provided (usually the project directory and temp directories). * It constructs a new command line that starts with `sandbox-exec`, passes the generated policy (`-p`), passes parameters defining the writable roots (`-D`), and finally appends the original command. * It then calls `rawExec` to run this *entire* `sandbox-exec ... -- original-command ...` line. The operating system handles enforcing the sandbox rules. ### Sandboxing with Docker: `Dockerfile` Another approach, often used on Linux or as a fallback, is Docker. The `Dockerfile` defines the restricted environment. ```dockerfile # File: codex-cli/Dockerfile (Simplified Snippets) # Start from a basic Node.js image FROM node:20 # Install only necessary tools (git, jq, rg, maybe python/bash, etc.) # Avoid installing powerful tools unless absolutely needed. RUN apt update && apt install -y \ git jq ripgrep sudo iproute2 iptables ipset \ # ... other minimal tools ... && apt-get clean && rm -rf /var/lib/apt/lists/* # Copy codex itself into the container COPY dist/codex.tgz codex.tgz RUN npm install -g codex.tgz # Setup non-root user USER node WORKDIR /home/node/workspace # Work happens here # Copy and set up firewall script (runs via sudo) # This script uses iptables/ipset to block network access by default, # potentially allowing only specific domains if configured. COPY scripts/init_firewall.sh /usr/local/bin/ USER root RUN chmod +x /usr/local/bin/init_firewall.sh && \ # Allow 'node' user to run firewall script via sudo without password echo "node ALL=(root) NOPASSWD: /usr/local/bin/init_firewall.sh" > /etc/sudoers.d/node-firewall USER node # Default command when container starts (might be codex or just a shell) # ENTRYPOINT ["codex"] ``` * **Minimal Tools:** The Docker image includes only a limited set of command-line tools, reducing the potential attack surface. * **Non-Root User:** Commands run as a non-privileged user (`node`) inside the container. * **Workspace:** Work typically happens in a specific directory (e.g., `/home/node/workspace`), often mapped to your project directory on the host machine. * **Network Firewall:** An `init_firewall.sh` script (run via `sudo` at startup or when needed) configures `iptables` to restrict network access. This prevents sandboxed commands from easily calling out to arbitrary internet addresses. * **Usage:** Codex might be run *entirely* within this container, or it might invoke commands *inside* this container from the outside using `docker exec`. ## Conclusion You've reached the end of the workshop tour! The **Command Execution & Sandboxing** system is Codex's way of actually *doing* things on the command line when instructed by the AI. It carefully considers the safety level decided by the [Approval Policy & Security](04_approval_policy___security.md) and chooses the right execution method: direct "raw" execution for trusted commands, or running inside a protective "sandbox" (like macOS Seatbelt or a Docker container) for potentially riskier operations, especially in `full-auto` mode. This layered approach allows Codex to be powerful while providing crucial safety mechanisms against unintended consequences. We've seen how Codex handles input, talks to the AI, checks policies, and executes commands. But how does Codex know *which* AI model to use, what your API key is, or which approval mode you prefer? All these settings need to be managed. Next up: [Configuration Management](07_configuration_management.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/07_configuration_management.md ================================================ --- layout: default title: "Configuration Management" parent: "Codex" nav_order: 7 --- # Chapter 7: Configuration Management In the [previous chapter](06_command_execution___sandboxing.md), we saw how Codex carefully executes commands, using sandboxing for safety when needed. But how does Codex remember your preferences between sessions? For instance, how does it know which AI model you like to use, or whether you prefer `auto-edit` mode? And how can you give Codex persistent instructions about how you want it to behave? This is where **Configuration Management** comes in. Think of it like the settings menu or preferences file for Codex. ## What's the Big Idea? Remembering Your Settings Imagine you prefer using the powerful `gpt-4o` model instead of the default `o4-mini`. Or perhaps you always want Codex to follow a specific coding style or avoid using certain commands unless you explicitly ask. It would be annoying to tell Codex this *every single time* you run it using command-line flags like `--model gpt-4o`. Configuration Management solves this by allowing Codex to: 1. **Load Default Settings:** Read a special file to know your preferred model, default [Approval Policy](04_approval_policy___security.md) mode, etc. 2. **Load Custom Instructions:** Read other special files containing your personal guidelines or project-specific rules for the AI. This way, Codex behaves consistently according to your setup without needing constant reminders. It's like setting up your favorite text editor with your preferred theme and plugins – you do it once, and it remembers. ## Key Concepts 1. **Configuration File (`config.yaml`)**: * **Where:** Lives in your home directory, inside a hidden folder: `~/.codex/config.yaml` (it might also be `.json` or `.yml`). * **What:** Stores your default settings. The most common setting is the AI `model` you want Codex to use. You can also set things like the default error handling behavior in `full-auto` mode (`fullAutoErrorMode`). * **Format:** Usually written in YAML (or JSON), which is a simple, human-readable format. 2. **Instruction Files (`instructions.md`, `codex.md`)**: * **Where:** * **Global:** `~/.codex/instructions.md` - These instructions apply every time you run Codex, anywhere on your system. * **Project-Specific:** `codex.md` (or `.codex.md`) - Placed in the root directory of your code project (or sometimes in subdirectories). These instructions apply only when you run Codex within that specific project. * **What:** Contain text instructions (written in Markdown) that guide the AI's behavior. Think of it as giving your AI assistant standing orders. * **Format:** Plain Markdown text. 3. **Loading Order:** Codex combines these instructions intelligently: * It first reads the global instructions (`~/.codex/instructions.md`). * Then, if it finds a project-specific `codex.md` in your current working directory (or its parent Git repository root), it adds those instructions too. This lets project-specific rules override or add to your global ones. ## How to Use It: Setting Your Preferences Let's make Codex always use `gpt-4o` and give it a global instruction. **1. Set the Default Model:** Create or edit the file `~/.codex/config.yaml` (you might need to create the `.codex` directory first). Add the following content: ```yaml # File: ~/.codex/config.yaml # Use the gpt-4o model by default for all Codex runs model: gpt-4o # Optional: How to handle errors when running commands in full-auto # fullAutoErrorMode: ask-user # (Default) Ask user what to do # fullAutoErrorMode: ignore-and-continue # Don't stop on error ``` * **Explanation:** This simple YAML file tells Codex that your preferred `model` is `gpt-4o`. Now, you don't need to type `--model gpt-4o` every time! **2. Add Global Instructions:** Create or edit the file `~/.codex/instructions.md`. Add some guidelines: ```markdown # File: ~/.codex/instructions.md - Always explain your reasoning step-by-step before suggesting code or commands. - Prefer using Python for scripting tasks unless otherwise specified. - Use emojis in your responses! 🎉 ``` * **Explanation:** This Markdown file gives the AI assistant general rules to follow during *any* conversation. **3. (Optional) Add Project Instructions:** Navigate to your project's root directory (e.g., `~/my-cool-project/`) and create a file named `codex.md`: ```markdown # File: ~/my-cool-project/codex.md - This project uses TypeScript and adheres to the Prettier style guide. - When adding new features, always include unit tests using Jest. - Do not run `git push` directly; always suggest creating a pull request. ``` * **Explanation:** When you run `codex` inside `~/my-cool-project/`, the AI will get *both* the global instructions *and* these project-specific ones. Now, when you run `codex` (without any flags overriding these settings), it will automatically: * Use the `gpt-4o` model. * Receive the combined instructions (global + project-specific, if applicable) to guide its responses and actions. You can disable loading the project `codex.md` file by using the `--no-project-doc` flag if needed. ## Under the Hood: How Codex Loads Configuration When you start the Codex CLI, one of the first things it does is figure out its configuration. ```mermaid sequenceDiagram participant CLI as Codex CLI Process participant ConfigLoader as config.ts (loadConfig) participant FileSystem as Your Computer's Files CLI->>ConfigLoader: Start: Call loadConfig() ConfigLoader->>FileSystem: Check for ~/.codex/config.yaml (or .json, .yml)? FileSystem-->>ConfigLoader: Found config.yaml ConfigLoader->>FileSystem: Read ~/.codex/config.yaml FileSystem-->>ConfigLoader: YAML content (e.g., model: gpt-4o) ConfigLoader->>ConfigLoader: Parse YAML, store model='gpt-4o' ConfigLoader->>FileSystem: Check for ~/.codex/instructions.md? FileSystem-->>ConfigLoader: Found instructions.md ConfigLoader->>FileSystem: Read ~/.codex/instructions.md FileSystem-->>ConfigLoader: Global instructions text ConfigLoader->>FileSystem: Check for project 'codex.md' (discoverProjectDocPath)? FileSystem-->>ConfigLoader: Found project/codex.md ConfigLoader->>FileSystem: Read project/codex.md FileSystem-->>ConfigLoader: Project instructions text ConfigLoader->>ConfigLoader: Combine global + project instructions ConfigLoader-->>CLI: Return AppConfig object { model, instructions } CLI->>CLI: Use AppConfig for AgentLoop, etc. ``` 1. **Start:** The main CLI process (`cli.tsx`) starts up. 2. **Load Config:** It calls the `loadConfig` function (from `utils/config.ts`). 3. **Read Settings:** `loadConfig` looks for `~/.codex/config.yaml` (or `.json`/`.yml`). If found, it reads the file, parses the YAML/JSON, and stores the settings (like `model`). If not found, it uses defaults (like `o4-mini`). 4. **Read Global Instructions:** It looks for `~/.codex/instructions.md`. If found, it reads the content. 5. **Find Project Instructions:** It calls helper functions like `discoverProjectDocPath` to search the current directory and parent directories (up to the Git root) for a `codex.md` file. 6. **Read Project Instructions:** If `codex.md` is found, it reads the content. 7. **Combine:** `loadConfig` concatenates the global and project instructions (if any) into a single string. 8. **Return:** It returns an `AppConfig` object containing the final model choice, the combined instructions, and other settings. 9. **Use Config:** The CLI process then uses this `AppConfig` object when setting up the [Agent Loop](03_agent_loop.md) and other parts of the application. ## Diving into Code (`config.ts`) The magic happens mainly in `codex-cli/src/utils/config.ts`. Here's how the CLI entry point (`cli.tsx`) uses `loadConfig`: ```typescript // File: codex-cli/src/cli.tsx (Simplified) import { loadConfig } from "./utils/config"; import App from "./app"; // ... other imports: React, render, meow ... // --- Get command line arguments --- const cli = meow(/* ... cli setup ... */); const prompt = cli.input[0]; const modelOverride = cli.flags.model; // e.g., --model gpt-4 // --- Load Configuration --- // loadConfig handles reading files and combining instructions let config = loadConfig( undefined, // Use default config file paths undefined, // Use default instructions file paths { cwd: process.cwd(), // Where are we running from? (for project docs) disableProjectDoc: Boolean(cli.flags.noProjectDoc), // Did user pass --no-project-doc? projectDocPath: cli.flags.projectDoc as string | undefined, // Explicit project doc? } ); // --- Apply Overrides --- // Command-line flags take precedence over config file settings config = { ...config, // Start with loaded config model: modelOverride ?? config.model, // Use flag model if provided, else keep loaded one apiKey: process.env["OPENAI_API_KEY"] || "", // Get API key from environment }; // --- Check Model Support --- // ... check if config.model is valid ... // --- Render the App --- // Pass the final, combined config object to the main UI component const instance = render( , ); ``` * **Explanation:** The code first calls `loadConfig`, passing options related to finding the project `codex.md`. It then merges these loaded settings with any overrides provided via command-line flags (like `--model`). The final `config` object is passed to the main React `` component. Inside `config.ts`, the loading logic looks something like this: ```typescript // File: codex-cli/src/utils/config.ts (Simplified) import { existsSync, readFileSync } from "fs"; import { load as loadYaml } from "js-yaml"; import { homedir } from "os"; import { join, dirname, resolve as resolvePath } from "path"; export const CONFIG_DIR = join(homedir(), ".codex"); export const CONFIG_YAML_FILEPATH = join(CONFIG_DIR, "config.yaml"); // ... other paths: .json, .yml, instructions.md ... export const DEFAULT_AGENTIC_MODEL = "o4-mini"; // Represents full runtime config export type AppConfig = { apiKey?: string; model: string; instructions: string; // ... other settings ... }; // Options for loading export type LoadConfigOptions = { cwd?: string; disableProjectDoc?: boolean; projectDocPath?: string; isFullContext?: boolean; // Affects default model choice }; export const loadConfig = ( configPath: string | undefined = CONFIG_YAML_FILEPATH, // Default path instructionsPath: string | undefined = join(CONFIG_DIR, "instructions.md"), options: LoadConfigOptions = {}, ): AppConfig => { let storedConfig: Record = {}; // Holds data from config.yaml // 1. Find and read config.yaml/.json/.yml let actualConfigPath = /* ... logic to find existing config file ... */ ; if (existsSync(actualConfigPath)) { try { const raw = readFileSync(actualConfigPath, "utf-8"); // Parse based on file extension (.yaml, .yml, .json) storedConfig = /* ... parse YAML or JSON ... */ raw; } catch { /* ignore parse errors */ } } // 2. Read global instructions.md const userInstructions = existsSync(instructionsPath) ? readFileSync(instructionsPath, "utf-8") : ""; // 3. Read project codex.md (if enabled) let projectDoc = ""; if (!options.disableProjectDoc /* ... and env var check ... */) { const cwd = options.cwd ?? process.cwd(); // loadProjectDoc handles discovery and reading the file projectDoc = loadProjectDoc(cwd, options.projectDocPath); } // 4. Combine instructions const combinedInstructions = [userInstructions, projectDoc] .filter((s) => s?.trim()) // Remove empty strings .join("\n\n--- project-doc ---\n\n"); // Join with separator // 5. Determine final model (use stored, else default) const model = storedConfig.model?.trim() ? storedConfig.model.trim() : (options.isFullContext ? /* full ctx default */ : DEFAULT_AGENTIC_MODEL); // 6. Assemble the final config object const config: AppConfig = { model: model, instructions: combinedInstructions, // ... merge other settings from storedConfig ... }; // ... First-run bootstrap logic to create default files if missing ... return config; }; // Helper to find and read project doc function loadProjectDoc(cwd: string, explicitPath?: string): string { const filepath = explicitPath ? resolvePath(cwd, explicitPath) : discoverProjectDocPath(cwd); // Search logic if (!filepath || !existsSync(filepath)) return ""; try { const buf = readFileSync(filepath); // Limit size, return content return buf.slice(0, /* MAX_BYTES */).toString("utf-8"); } catch { return ""; } } // Helper to find codex.md by walking up directories function discoverProjectDocPath(startDir: string): string | null { // ... logic to check current dir, then walk up to git root ... // ... checks for codex.md, .codex.md etc. ... return /* path or null */; } ``` * **Explanation:** `loadConfig` reads the YAML/JSON config file, reads the global `instructions.md`, uses helpers like `loadProjectDoc` and `discoverProjectDocPath` to find and read the project-specific `codex.md`, combines the instructions, determines the final model name (using defaults if necessary), and returns everything in a structured `AppConfig` object. ## Conclusion Configuration Management makes Codex much more convenient and personalized. By reading settings from `~/.codex/config.yaml` and instructions from `~/.codex/instructions.md` and project-specific `codex.md` files, it remembers your preferences (like your favorite AI model) and follows your standing orders without you needing to repeat them every time. This allows for a smoother and more consistent interaction tailored to your workflow and project needs. So far, we've mostly seen Codex working interactively in a chat-like loop. But what if you want Codex to perform a task and exit, perhaps as part of a script? Next up: [Single-Pass Mode](08_single_pass_mode.md) --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/08_single_pass_mode.md ================================================ --- layout: default title: "Single-Pass Mode" parent: "Codex" nav_order: 8 --- # Chapter 8: Single-Pass Mode In the [previous chapter](07_configuration_management.md), we explored how Codex uses configuration files to remember your preferences and follow custom instructions. We've mostly seen Codex operate in its default interactive mode, like having a conversation in the [Terminal UI](01_terminal_ui__ink_components_.md) where the [Agent Loop](03_agent_loop.md) goes back and forth with the AI. But what if you have a task that's very clearly defined? Imagine you want to rename a function across your entire project. You know exactly what needs to be done, and you don't really need a back-and-forth chat. Wouldn't it be faster if you could just give Codex the instructions and have it figure out *all* the necessary changes at once? That's exactly the idea behind **Single-Pass Mode**. ## What's the Big Idea? The Architect Analogy Think about building a house. The normal, interactive mode of Codex is like having a conversation with your architect room by room: "Let's design the kitchen." "Okay, now how about the living room?" "Should we add a window here?". It's collaborative and allows for adjustments along the way. **Single-Pass Mode** is different. It's like giving the architect the complete blueprints, all the requirements, and the site survey *upfront*, and asking them to come back with the *final, complete building plan* in one go. In this experimental mode, Codex tries to: 1. Gather a large amount of context about your project (lots of code files). 2. Send your request *and* all that context to the AI model *at the same time*. 3. Ask the AI to generate a *complete set* of file operations (creations, updates, deletions) needed to fulfill your request, all in a single response. 4. Show you the proposed changes for review. 5. If you approve, apply all the changes and exit. This mode aims for efficiency, especially on larger, well-defined tasks where you're reasonably confident the AI can generate the full solution without needing clarification. ## Key Concepts 1. **Full Context (Within Limits):** Instead of just looking at one or two files, Codex gathers the content of many files in your project (respecting ignore rules from [Configuration Management](07_configuration_management.md) and size limits like `MAX_CONTEXT_CHARACTER_LIMIT`). This gives the AI a broader view of your codebase. 2. **Single Structured Response:** The AI isn't just asked for text. It's specifically instructed to respond with a structured list of *all* the file operations required. Codex uses a predefined schema (like `EditedFilesSchema` defined using Zod in `file_ops.ts`) to tell the AI exactly how to format this list. 3. **All-or-Nothing Confirmation:** You are presented with a summary and a diff (showing additions and deletions) of *all* the proposed changes across all affected files. You then give a single "Yes" or "No" to apply everything or nothing. 4. **Efficiency for Defined Tasks:** This mode shines when your instructions are clear and the task doesn't likely require interactive refinement (e.g., "Rename function X to Y everywhere", "Add logging to every public method in class Z"). ## How to Use It You typically invoke single-pass mode using a specific command-line flag when running Codex (the exact flag might vary, but let's assume `--single-pass`). **Example:** Let's say you want to rename a function `calculate_total` to `compute_grand_total` throughout your project located in `~/my-sales-app/`. ```bash cd ~/my-sales-app/ codex --single-pass "Rename the function 'calculate_total' to 'compute_grand_total' in all project files." ``` **What Happens:** 1. **Context Loading:** Codex will identify the files in `~/my-sales-app/` (respecting ignores), read their content, and note the size. You might see output indicating this. 2. **AI Thinking:** It sends your prompt and the file contents to the AI, asking for the complete set of changes. You'll likely see a spinner. 3. **Review:** Codex receives the proposed file operations from the AI. It calculates the differences (diffs) and shows you a summary: ``` Summary: Modified: src/utils.py (+1/-1) Modified: tests/test_utils.py (+1/-1) Modified: main_app.py (+1/-1) Proposed Diffs: ================================================================================ Changes for: src/utils.py -------------------------------------------------------------------------------- @@ -10,7 +10,7 @@ # ... code ... -def calculate_total(items): +def compute_grand_total(items): # ... implementation ... # ... (more diffs for other files) ... Apply these changes? [y/N] ``` 4. **Confirmation:** You type `y` and press Enter. 5. **Applying:** Codex modifies the files `src/utils.py`, `tests/test_utils.py`, and `main_app.py` according to the diffs. 6. **Exit:** The Codex process finishes. If you had typed `n`, no files would have been changed. ## Under the Hood: The Single-Pass Flow Let's trace the journey when you run `codex --single-pass "prompt"`: ```mermaid sequenceDiagram participant User participant CLI as Codex CLI (SinglePass) participant ContextLoader as context_files.ts participant OpenAI participant FileSystem User->>CLI: Runs `codex --single-pass "Rename func..."` CLI->>ContextLoader: Get project file contents (respecting ignores) ContextLoader->>FileSystem: Reads relevant files FileSystem-->>ContextLoader: File contents ContextLoader-->>CLI: Returns list of files & content CLI->>CLI: Formats huge prompt (request + file contents) using `renderTaskContext` CLI->>OpenAI: Sends single large request (expecting structured `EditedFilesSchema` response) Note over CLI, OpenAI: AI processes context and request OpenAI-->>CLI: Returns structured response { ops: [ {path:..., updated_full_content:...}, ... ] } CLI->>CLI: Parses the `ops` list (`file_ops.ts`) CLI->>CLI: Generates diffs and summary (`code_diff.ts`) CLI->>User: Displays summary & diffs, asks "Apply changes? [y/N]" User->>CLI: Types 'y' CLI->>FileSystem: Applies changes (writes updated content, creates/deletes files) CLI->>User: Shows "Changes applied." message CLI->>CLI: Exits ``` 1. **Invocation:** The CLI (`cli_singlepass.tsx`) is started in single-pass mode. 2. **Context Gathering:** It uses functions like `getFileContents` from `utils/singlepass/context_files.ts` to read the content of project files, respecting ignore patterns and size limits. 3. **Prompt Construction:** It builds a large prompt using `renderTaskContext` from `utils/singlepass/context.ts`. This prompt includes your request and embeds the content of all gathered files, often in an XML-like format. 4. **AI Call:** It sends this single, massive prompt to the OpenAI API. Crucially, it tells the API to format the response according to a specific structure (`EditedFilesSchema` from `utils/singlepass/file_ops.ts`) which expects a list of file operations. 5. **Response Parsing:** The CLI receives the response and uses the `EditedFilesSchema` to parse the expected list of operations (create file, update file content, delete file, move file). 6. **Diffing & Summary:** It uses helpers like `generateDiffSummary` and `generateEditSummary` from `utils/singlepass/code_diff.ts` to compare the proposed `updated_full_content` for each operation against the original file content, generating human-readable diffs and a summary. 7. **Confirmation:** The main application component (`SinglePassApp` in `components/singlepass-cli-app.tsx`) displays the summary and diffs using Ink components and prompts the user for confirmation (`ConfirmationPrompt`). 8. **Application:** If confirmed, the `applyFileOps` function iterates through the parsed operations and uses Node.js's `fs.promises` module (`fsPromises.writeFile`, `fsPromises.unlink`, etc.) to modify the files on disk. 9. **Exit:** The application cleans up and exits. ## Diving into Code Let's look at the key parts involved. ### Starting Single-Pass Mode (`cli_singlepass.tsx`) This module likely provides the entry point function called by the main CLI when the `--single-pass` flag is detected. ```typescript // File: codex-cli/src/cli_singlepass.tsx (Simplified) import type { AppConfig } from "./utils/config"; import { SinglePassApp } from "./components/singlepass-cli-app"; import { render } from "ink"; import React from "react"; // This function is called by the main CLI logic export async function runSinglePass({ originalPrompt, // The user's request string config, // Loaded configuration (model, instructions) rootPath, // The project directory }: { /* ... */ }): Promise { return new Promise((resolve) => { // Render the dedicated Ink UI for single-pass mode render( resolve()} // Callback when the app is done />, ); }); } ``` * **Explanation:** This function simply renders the main React component (`SinglePassApp`) responsible for the entire single-pass UI and logic, passing along the user's prompt and configuration. It uses a Promise to signal when the process is complete. ### The Main UI and Logic (`singlepass-cli-app.tsx`) This component manages the state (loading, thinking, confirming, etc.) and orchestrates the single-pass flow. ```typescript // File: codex-cli/src/components/singlepass-cli-app.tsx (Simplified Snippets) import React, { useEffect, useState } from "react"; import { Box, Text, useApp } from "ink"; import OpenAI from "openai"; import { zodResponseFormat } from "openai/helpers/zod"; // --- Local Utils --- import { getFileContents } from "../utils/singlepass/context_files"; import { renderTaskContext } from "../utils/singlepass/context"; import { EditedFilesSchema, FileOperation } from "../utils/singlepass/file_ops"; import { generateDiffSummary, generateEditSummary } from "../utils/singlepass/code_diff"; import * as fsPromises from "fs/promises"; // --- UI Components --- import { InputPrompt, ConfirmationPrompt } from "./prompts"; // Conceptual grouping export function SinglePassApp({ /* ...props: config, rootPath, onExit ... */ }): JSX.Element { const app = useApp(); const [state, setState] = useState("init"); // 'init', 'prompt', 'thinking', 'confirm', 'applied', 'error'... const [files, setFiles] = useState([]); // Holds { path, content } const [diffInfo, setDiffInfo] = useState({ summary: "", diffs: "", ops: [] }); // 1. Load file context on mount useEffect(() => { (async () => { const fileContents = await getFileContents(rootPath, /* ignorePatterns */); setFiles(fileContents); setState("prompt"); // Ready for user input })(); }, [rootPath]); // 2. Function to run the AI task async function runSinglePassTask(userPrompt: string) { setState("thinking"); try { // Format the context + prompt for the AI const taskContextStr = renderTaskContext({ prompt: userPrompt, files, /*...*/ }); const openai = new OpenAI({ /* ... config ... */ }); // Call OpenAI, specifying the expected structured response format const chatResp = await openai.beta.chat.completions.parse({ model: config.model, messages: [{ role: "user", content: taskContextStr }], response_format: zodResponseFormat(EditedFilesSchema, "schema"), // Ask for this specific structure! }); const edited = chatResp.choices[0]?.message?.parsed; // The parsed { ops: [...] } object if (!edited || !Array.isArray(edited.ops)) { /* Handle no ops */ } // Generate diffs from the AI's proposed operations const [combinedDiffs, opsToApply] = generateDiffSummary(edited, /* original files map */); if (!opsToApply.length) { /* Handle no actual changes */ } const summary = generateEditSummary(opsToApply, /* original files map */); setDiffInfo({ summary, diffs: combinedDiffs, ops: opsToApply }); setState("confirm"); // Move to confirmation state } catch (err) { setState("error"); } } // 3. Function to apply the changes async function applyFileOps(ops: Array) { for (const op of ops) { if (op.delete) { await fsPromises.unlink(op.path).catch(() => {}); } else { // Create or Update const newContent = op.updated_full_content || ""; await fsPromises.mkdir(path.dirname(op.path), { recursive: true }); await fsPromises.writeFile(op.path, newContent, "utf-8"); } // Handle move_to separately if needed } setState("applied"); } // --- Render logic based on `state` --- if (state === "prompt") { return ; } if (state === "thinking") { /* Show Spinner */ } if (state === "confirm") { return ( {/* Display diffInfo.summary and diffInfo.diffs */} { if (accept) applyFileOps(diffInfo.ops); else setState("skipped"); }} /> ); } if (state === "applied") { /* Show success, maybe offer another prompt */ } // ... other states: init, error, skipped ... return ...; // Fallback } ``` * **Explanation:** This component uses `useEffect` to load files initially. The `runSinglePassTask` function orchestrates calling the AI (using `zodResponseFormat` to enforce the `EditedFilesSchema`) and generating diffs. `applyFileOps` performs the actual file system changes if the user confirms via the `ConfirmationPrompt`. The UI rendered depends heavily on the current `state`. ### Defining the AI's Output: `file_ops.ts` This file defines the exact structure Codex expects the AI to return in single-pass mode. ```typescript // File: codex-cli/src/utils/singlepass/file_ops.ts (Simplified) import { z } from "zod"; // Zod is a schema validation library // Schema for a single file operation export const FileOperationSchema = z.object({ path: z.string().describe("Absolute path to the file."), updated_full_content: z.string().optional().describe( "FULL CONTENT of the file after modification. MUST provide COMPLETE content." ), delete: z.boolean().optional().describe("Set true to delete the file."), move_to: z.string().optional().describe("New absolute path if file is moved."), // Ensure only one action per operation (update, delete, or move) }).refine(/* ... validation logic ... */); // Schema for the overall response containing a list of operations export const EditedFilesSchema = z.object({ ops: z.array(FileOperationSchema).describe("List of file operations."), }); export type FileOperation = z.infer; export type EditedFiles = z.infer; ``` * **Explanation:** This uses the Zod library to define a strict schema. `FileOperationSchema` describes a single change (update, delete, or move), emphasizing that `updated_full_content` must be the *entire* file content. `EditedFilesSchema` wraps this in a list called `ops`. This schema is given to the OpenAI API (via `zodResponseFormat`) to ensure the AI's response is structured correctly. ### Generating Context and Diffs * **`context.ts` (`renderTaskContext`):** Takes the user prompt and file contents and formats them into the large string sent to the AI, including instructions and often wrapping file content in XML-like tags (`......`). * **`code_diff.ts` (`generateDiffSummary`, `generateEditSummary`):** Takes the `ops` returned by the AI and compares the `updated_full_content` with the original content read from disk. It uses a library (like `diff`) to generate standard diff text and then formats it (often with colors) and creates a short summary list for display. ## Conclusion Single-Pass Mode offers a different, potentially faster way to use Codex for well-defined tasks. By providing extensive context upfront and asking the AI for a complete set of structured file operations in one response, it minimizes back-and-forth. You gather context, send one big request, review the complete proposed solution, and either accept or reject it entirely. While still experimental, it's a powerful approach for streamlining larger refactoring or generation tasks where the requirements are clear. This concludes our tour through the core concepts of Codex! We've journeyed from the [Terminal UI](01_terminal_ui__ink_components_.md) and [Input Handling](02_input_handling__textbuffer_editor_.md), through the central [Agent Loop](03_agent_loop.md), into the crucial aspects of [Approval Policy & Security](04_approval_policy___security.md), [Response & Tool Call Handling](05_response___tool_call_handling.md), and safe [Command Execution & Sandboxing](06_command_execution___sandboxing.md), learned about [Configuration Management](07_configuration_management.md), and finally explored the alternative [Single-Pass Mode](08_single_pass_mode.md). We hope this gives you a solid understanding of how Codex works under the hood. Feel free to dive deeper into the codebase, experiment, and perhaps even contribute! --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Codex/index.md ================================================ --- layout: default title: "Codex" nav_order: 5 has_children: true --- # Tutorial: Codex > This tutorial is AI-generated! To learn more, check out [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) Codex[View Repo](https://github.com/openai/codex) is a command-line interface (CLI) tool that functions as an **AI coding assistant**. It runs in your terminal, allowing you to chat with an AI model (like *GPT-4o*) to understand, modify, and generate code within your projects. The tool can read files, apply changes (*patches*), and execute shell commands, prioritizing safety through user **approval policies** and command **sandboxing**. It supports both interactive chat and a non-interactive *single-pass mode* for batch operations. ```mermaid flowchart TD A0["Agent Loop"] A1["Terminal UI (Ink Components)"] A2["Approval Policy & Security"] A3["Command Execution & Sandboxing"] A4["Configuration Management"] A5["Response & Tool Call Handling"] A6["Single-Pass Mode"] A7["Input Handling (TextBuffer/Editor)"] A0 -- "Drives updates for" --> A1 A0 -- "Processes responses via" --> A5 A0 -- "Consults policy from" --> A2 A0 -- "Loads config using" --> A4 A1 -- "Uses editor for input" --> A7 A2 -- "Dictates sandboxing for" --> A3 A4 -- "Provides settings to" --> A2 A5 -- "Triggers" --> A3 A7 -- "Provides user input to" --> A0 A0 -- "Can initiate" --> A6 A6 -- "Renders via specific UI" --> A1 ``` ================================================ FILE: docs/Crawl4AI/01_asynccrawlerstrategy.md ================================================ --- layout: default title: "AsyncCrawlerStrategy" parent: "Crawl4AI" nav_order: 1 --- # Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy Welcome to the Crawl4AI tutorial series! Our goal is to build intelligent agents that can understand and extract information from the web. The very first step in this process is actually *getting* the content from a webpage. This chapter explains how Crawl4AI handles that fundamental task. Imagine you need to pick up a package from a specific address. How do you get there and retrieve it? * You could send a **simple, fast drone** that just grabs the package off the porch (if it's easily accessible). This is quick but might fail if the package is inside or requires a signature. * Or, you could send a **full delivery truck with a driver**. The driver can ring the bell, wait, sign for the package, and even handle complex instructions. This is more versatile but takes more time and resources. In Crawl4AI, the `AsyncCrawlerStrategy` is like choosing your delivery vehicle. It defines *how* the crawler fetches the raw content (like the HTML, CSS, and maybe JavaScript results) of a webpage. ## What Exactly is AsyncCrawlerStrategy? `AsyncCrawlerStrategy` is a core concept in Crawl4AI that represents the **method** or **technique** used to download the content of a given URL. Think of it as a blueprint: it specifies *that* we need a way to fetch content, but the specific *details* of how it's done can vary. This "blueprint" approach is powerful because it allows us to swap out the fetching mechanism depending on our needs, without changing the rest of our crawling logic. ## The Default: AsyncPlaywrightCrawlerStrategy (The Delivery Truck) By default, Crawl4AI uses `AsyncPlaywrightCrawlerStrategy`. This strategy uses a real, automated web browser engine (like Chrome, Firefox, or WebKit) behind the scenes. **Why use a full browser?** * **Handles JavaScript:** Modern websites rely heavily on JavaScript to load content, change the layout, or fetch data after the initial page load. `AsyncPlaywrightCrawlerStrategy` runs this JavaScript, just like your normal browser does. * **Simulates User Interaction:** It can wait for elements to appear, handle dynamic content, and see the page *after* scripts have run. * **Gets the "Final" View:** It fetches the content as a user would see it in their browser. This is our "delivery truck" – powerful and capable of handling complex websites. However, like a real truck, it's slower and uses more memory and CPU compared to simpler methods. You generally don't need to *do* anything to use it, as it's the default! When you start Crawl4AI, it picks this strategy automatically. ## Another Option: AsyncHTTPCrawlerStrategy (The Delivery Drone) Crawl4AI also offers `AsyncHTTPCrawlerStrategy`. This strategy is much simpler. It directly requests the URL and downloads the *initial* HTML source code that the web server sends back. **Why use this simpler strategy?** * **Speed:** It's significantly faster because it doesn't need to start a browser, render the page, or execute JavaScript. * **Efficiency:** It uses much less memory and CPU. This is our "delivery drone" – super fast and efficient for simple tasks. **What's the catch?** * **No JavaScript:** It won't run any JavaScript on the page. If content is loaded dynamically by scripts, this strategy will likely miss it. * **Basic HTML Only:** You get the raw HTML source, not necessarily what a user *sees* after the browser processes everything. This strategy is great for websites with simple, static HTML content or when you only need the basic structure and metadata very quickly. ## Why Have Different Strategies? (The Power of Abstraction) Having `AsyncCrawlerStrategy` as a distinct concept offers several advantages: 1. **Flexibility:** You can choose the best tool for the job. Need to crawl complex, dynamic sites? Use the default `AsyncPlaywrightCrawlerStrategy`. Need to quickly fetch basic HTML from thousands of simple pages? Switch to `AsyncHTTPCrawlerStrategy`. 2. **Maintainability:** The logic for *fetching* content is kept separate from the logic for *processing* it. 3. **Extensibility:** Advanced users could even create their *own* custom strategies for specialized fetching needs (though that's beyond this beginner tutorial). ## How It Works Conceptually When you ask Crawl4AI to crawl a URL, the main `AsyncWebCrawler` doesn't fetch the content itself. Instead, it delegates the task to the currently selected `AsyncCrawlerStrategy`. Here's a simplified flow: ```mermaid sequenceDiagram participant C as AsyncWebCrawler participant S as AsyncCrawlerStrategy participant W as Website C->>S: Please crawl("https://example.com") Note over S: I'm using my method (e.g., Browser or HTTP) S->>W: Request Page Content W-->>S: Return Raw Content (HTML, etc.) S-->>C: Here's the result (AsyncCrawlResponse) ``` The `AsyncWebCrawler` only needs to know how to talk to *any* strategy through a common interface (the `crawl` method). The strategy handles the specific details of the fetching process. ## Using the Default Strategy (You're Already Doing It!) Let's see how you use the default `AsyncPlaywrightCrawlerStrategy` without even needing to specify it. ```python # main_example.py import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode async def main(): # When you create AsyncWebCrawler without specifying a strategy, # it automatically uses AsyncPlaywrightCrawlerStrategy! async with AsyncWebCrawler() as crawler: print("Crawler is ready using the default strategy (Playwright).") # Let's crawl a simple page that just returns HTML # We use CacheMode.BYPASS to ensure we fetch it fresh each time for this demo. config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) result = await crawler.arun( url="https://httpbin.org/html", config=config ) if result.success: print("\nSuccessfully fetched content!") # The strategy fetched the raw HTML. # AsyncWebCrawler then processes it (more on that later). print(f"First 100 chars of fetched HTML: {result.html[:100]}...") else: print(f"\nFailed to fetch content: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` **Explanation:** 1. We import `AsyncWebCrawler` and supporting classes. 2. We create an instance of `AsyncWebCrawler()` inside an `async with` block (this handles setup and cleanup). Since we didn't tell it *which* strategy to use, it defaults to `AsyncPlaywrightCrawlerStrategy`. 3. We call `crawler.arun()` to crawl the URL. Under the hood, the `AsyncPlaywrightCrawlerStrategy` starts a browser, navigates to the page, gets the content, and returns it. 4. We print the first part of the fetched HTML from the `result`. ## Explicitly Choosing the HTTP Strategy What if you know the page is simple and want the speed of the "delivery drone"? You can explicitly tell `AsyncWebCrawler` to use `AsyncHTTPCrawlerStrategy`. ```python # http_strategy_example.py import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode # Import the specific strategies we want to use from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy async def main(): # 1. Create an instance of the strategy you want http_strategy = AsyncHTTPCrawlerStrategy() # 2. Pass the strategy instance when creating the AsyncWebCrawler async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler: print("Crawler is ready using the explicit HTTP strategy.") # Crawl the same simple page config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) result = await crawler.arun( url="https://httpbin.org/html", config=config ) if result.success: print("\nSuccessfully fetched content using HTTP strategy!") print(f"First 100 chars of fetched HTML: {result.html[:100]}...") else: print(f"\nFailed to fetch content: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` **Explanation:** 1. We now also import `AsyncHTTPCrawlerStrategy`. 2. We create an instance: `http_strategy = AsyncHTTPCrawlerStrategy()`. 3. We pass this instance to the `AsyncWebCrawler` constructor: `AsyncWebCrawler(crawler_strategy=http_strategy)`. 4. The rest of the code is the same, but now `crawler.arun()` will use the faster, simpler HTTP GET request method defined by `AsyncHTTPCrawlerStrategy`. For a simple page like `httpbin.org/html`, both strategies will likely return the same HTML content, but the HTTP strategy would generally be faster and use fewer resources. On a complex JavaScript-heavy site, the HTTP strategy might fail to get the full content, while the Playwright strategy would handle it correctly. ## A Glimpse Under the Hood You don't *need* to know the deep internals to use the strategies, but it helps to understand the structure. Inside the `crawl4ai` library, you'd find a file like `async_crawler_strategy.py`. It defines the "blueprint" (an Abstract Base Class): ```python # Simplified from async_crawler_strategy.py from abc import ABC, abstractmethod from .models import AsyncCrawlResponse # Defines the structure of the result class AsyncCrawlerStrategy(ABC): """ Abstract base class for crawler strategies. """ @abstractmethod async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse: """Fetch content from the URL.""" pass # Each specific strategy must implement this ``` And then the specific implementations: ```python # Simplified from async_crawler_strategy.py from playwright.async_api import Page # Playwright library for browser automation # ... other imports class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): # ... (Initialization code to manage browsers) async def crawl(self, url: str, config: CrawlerRunConfig, **kwargs) -> AsyncCrawlResponse: # Uses Playwright to: # 1. Get a browser page # 2. Navigate to the url (page.goto(url)) # 3. Wait for content, run JS, etc. # 4. Get the final HTML (page.content()) # 5. Optionally take screenshots, etc. # 6. Return an AsyncCrawlResponse # ... implementation details ... pass ``` ```python # Simplified from async_crawler_strategy.py import aiohttp # Library for making HTTP requests asynchronously # ... other imports class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy): # ... (Initialization code to manage HTTP sessions) async def crawl(self, url: str, config: CrawlerRunConfig, **kwargs) -> AsyncCrawlResponse: # Uses aiohttp to: # 1. Make an HTTP GET (or other method) request to the url # 2. Read the response body (HTML) # 3. Get response headers and status code # 4. Return an AsyncCrawlResponse # ... implementation details ... pass ``` The key takeaway is that both strategies implement the same `crawl` method, allowing `AsyncWebCrawler` to use them interchangeably. ## Conclusion You've learned about `AsyncCrawlerStrategy`, the core concept defining *how* Crawl4AI fetches webpage content. * It's like choosing a vehicle: a powerful browser (`AsyncPlaywrightCrawlerStrategy`, the default) or a fast, simple HTTP request (`AsyncHTTPCrawlerStrategy`). * This abstraction gives you flexibility to choose the right fetching method for your task. * You usually don't need to worry about it, as the default handles most modern websites well. Now that we understand how the raw content is fetched, the next step is to look at the main class that orchestrates the entire crawling process. **Next:** Let's dive into the [AsyncWebCrawler](02_asyncwebcrawler.md) itself! --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Crawl4AI/02_asyncwebcrawler.md ================================================ --- layout: default title: "AsyncWebCrawler" parent: "Crawl4AI" nav_order: 2 --- # Chapter 2: Meet the General Manager - AsyncWebCrawler In [Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy](01_asynccrawlerstrategy.md), we learned about the different ways Crawl4AI can fetch the raw content of a webpage, like choosing between a fast drone (`AsyncHTTPCrawlerStrategy`) or a versatile delivery truck (`AsyncPlaywrightCrawlerStrategy`). But who decides *which* delivery vehicle to use? Who tells it *which* address (URL) to go to? And who takes the delivered package (the raw HTML) and turns it into something useful? That's where the `AsyncWebCrawler` comes in. Think of it as the **General Manager** of the entire crawling operation. ## What Problem Does `AsyncWebCrawler` Solve? Imagine you want to get information from a website. You need to: 1. Decide *how* to fetch the page (like choosing the drone or truck from Chapter 1). 2. Actually *fetch* the page content. 3. Maybe *clean up* the messy HTML. 4. Perhaps *extract* specific pieces of information (like product prices or article titles). 5. Maybe *save* the results so you don't have to fetch them again immediately (caching). 6. Finally, give you the *final, processed result*. Doing all these steps manually for every URL would be tedious and complex. `AsyncWebCrawler` acts as the central coordinator, managing all these steps for you. You just tell it what URL to crawl and maybe some preferences, and it handles the rest. ## What is `AsyncWebCrawler`? `AsyncWebCrawler` is the main class you'll interact with when using Crawl4AI. It's the primary entry point for starting any crawling task. **Key Responsibilities:** * **Initialization:** Sets up the necessary components, like the browser (if needed). * **Coordination:** Takes your request (a URL and configuration) and orchestrates the different parts: * Delegates fetching to an [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md). * Manages caching using [CacheContext / CacheMode](09_cachecontext___cachemode.md). * Uses a [ContentScrapingStrategy](04_contentscrapingstrategy.md) to clean and parse HTML. * Applies a [RelevantContentFilter](05_relevantcontentfilter.md) if configured. * Uses an [ExtractionStrategy](06_extractionstrategy.md) to pull out specific data if needed. * **Result Packaging:** Bundles everything up into a neat [CrawlResult](07_crawlresult.md) object. * **Resource Management:** Handles starting and stopping resources (like browsers) cleanly. It's the "conductor" making sure all the different instruments play together harmoniously. ## Your First Crawl: Using `arun` Let's see the `AsyncWebCrawler` in action. The most common way to use it is with an `async with` block, which automatically handles setup and cleanup. The main method to crawl a single URL is `arun`. ```python # chapter2_example_1.py import asyncio from crawl4ai import AsyncWebCrawler # Import the General Manager async def main(): # Create the General Manager instance using 'async with' # This handles setup (like starting a browser if needed) # and cleanup (closing the browser). async with AsyncWebCrawler() as crawler: print("Crawler is ready!") # Tell the manager to crawl a specific URL url_to_crawl = "https://httpbin.org/html" # A simple example page print(f"Asking the crawler to fetch: {url_to_crawl}") result = await crawler.arun(url=url_to_crawl) # Check if the crawl was successful if result.success: print("\nSuccess! Crawler got the content.") # The result object contains the processed data # We'll learn more about CrawlResult in Chapter 7 print(f"Page Title: {result.metadata.get('title', 'N/A')}") print(f"First 100 chars of Markdown: {result.markdown.raw_markdown[:100]}...") else: print(f"\nFailed to crawl: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` **Explanation:** 1. **`import AsyncWebCrawler`**: We import the main class. 2. **`async def main():`**: Crawl4AI uses Python's `asyncio` for efficiency, so our code needs to be in an `async` function. 3. **`async with AsyncWebCrawler() as crawler:`**: This is the standard way to create and manage the crawler. The `async with` statement ensures that resources (like the underlying browser used by the default `AsyncPlaywrightCrawlerStrategy`) are properly started and stopped, even if errors occur. 4. **`crawler.arun(url=url_to_crawl)`**: This is the core command. We tell our `crawler` instance (the General Manager) to run (`arun`) the crawling process for the specified `url`. `await` is used because fetching webpages takes time, and `asyncio` allows other tasks to run while waiting. 5. **`result`**: The `arun` method returns a `CrawlResult` object. This object contains all the information gathered during the crawl (HTML, cleaned text, metadata, etc.). We'll explore this object in detail in [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md). 6. **`result.success`**: We check this boolean flag to see if the crawl completed without critical errors. 7. **Accessing Data:** If successful, we can access processed information like the page title (`result.metadata['title']`) or the content formatted as Markdown (`result.markdown.raw_markdown`). ## Configuring the Crawl Sometimes, the default behavior isn't quite what you need. Maybe you want to use the faster "drone" strategy from Chapter 1, or perhaps you want to ensure you *always* fetch a fresh copy of the page, ignoring any saved cache. You can customize the behavior of a specific `arun` call by passing a `CrawlerRunConfig` object. Think of this as giving specific instructions to the General Manager for *this particular job*. ```python # chapter2_example_2.py import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai import CrawlerRunConfig # Import configuration class from crawl4ai import CacheMode # Import cache options async def main(): async with AsyncWebCrawler() as crawler: print("Crawler is ready!") url_to_crawl = "https://httpbin.org/html" # Create a specific configuration for this run # Tell the crawler to BYPASS the cache (fetch fresh) run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS ) print("Configuration: Bypass cache for this run.") # Pass the config object to the arun method result = await crawler.arun( url=url_to_crawl, config=run_config # Pass the specific instructions ) if result.success: print("\nSuccess! Crawler got fresh content (cache bypassed).") print(f"Page Title: {result.metadata.get('title', 'N/A')}") else: print(f"\nFailed to crawl: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` **Explanation:** 1. **`from crawl4ai import CrawlerRunConfig, CacheMode`**: We import the necessary classes for configuration. 2. **`run_config = CrawlerRunConfig(...)`**: We create an instance of `CrawlerRunConfig`. This object holds various settings for a specific crawl job. 3. **`cache_mode=CacheMode.BYPASS`**: We set the `cache_mode`. `CacheMode.BYPASS` tells the crawler to ignore any previously saved results for this URL and fetch it directly from the web server. We'll learn all about caching options in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md). 4. **`crawler.arun(..., config=run_config)`**: We pass our custom `run_config` object to the `arun` method using the `config` parameter. The `CrawlerRunConfig` is very powerful and lets you control many aspects of the crawl, including which scraping or extraction methods to use. We'll dive deep into it in the next chapter: [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md). ## What Happens When You Call `arun`? (The Flow) When you call `crawler.arun(url="...")`, the `AsyncWebCrawler` (our General Manager) springs into action and coordinates several steps behind the scenes: ```mermaid sequenceDiagram participant U as User participant AWC as AsyncWebCrawler (Manager) participant CC as Cache Check participant CS as AsyncCrawlerStrategy (Fetcher) participant SP as Scraping/Processing participant CR as CrawlResult (Final Report) U->>AWC: arun("https://example.com", config) AWC->>CC: Need content for "https://example.com"? (Respect CacheMode in config) alt Cache Hit & Cache Mode allows reading CC-->>AWC: Yes, here's the cached result. AWC-->>CR: Package cached result. AWC-->>U: Here is the CrawlResult else Cache Miss or Cache Mode prevents reading CC-->>AWC: No cached result / Cannot read cache. AWC->>CS: Please fetch "https://example.com" (using configured strategy) CS-->>AWC: Here's the raw response (HTML, etc.) AWC->>SP: Process this raw content (Scrape, Filter, Extract based on config) SP-->>AWC: Here's the processed data (Markdown, Metadata, etc.) AWC->>CC: Cache this result? (Respect CacheMode in config) CC-->>AWC: OK, cached. AWC-->>CR: Package new result. AWC-->>U: Here is the CrawlResult end ``` **Simplified Steps:** 1. **Receive Request:** The `AsyncWebCrawler` gets the URL and configuration from your `arun` call. 2. **Check Cache:** It checks if a valid result for this URL is already saved (cached) and if the `CacheMode` allows using it. (See [Chapter 9](09_cachecontext___cachemode.md)). 3. **Fetch (if needed):** If no valid cached result exists or caching is bypassed, it asks the configured [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) (e.g., Playwright or HTTP) to fetch the raw page content. 4. **Process Content:** It takes the raw HTML and passes it through various processing steps based on the configuration: * **Scraping:** Cleaning up HTML, extracting basic structure using a [ContentScrapingStrategy](04_contentscrapingstrategy.md). * **Filtering:** Optionally filtering content for relevance using a [RelevantContentFilter](05_relevantcontentfilter.md). * **Extraction:** Optionally extracting specific structured data using an [ExtractionStrategy](06_extractionstrategy.md). 5. **Cache Result (if needed):** If caching is enabled for writing, it saves the final processed result. 6. **Return Result:** It bundles everything into a [CrawlResult](07_crawlresult.md) object and returns it to you. ## Crawling Many Pages: `arun_many` What if you have a whole list of URLs to crawl? Calling `arun` in a loop works, but it might not be the most efficient way. `AsyncWebCrawler` provides the `arun_many` method designed for this. ```python # chapter2_example_3.py import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode async def main(): async with AsyncWebCrawler() as crawler: urls_to_crawl = [ "https://httpbin.org/html", "https://httpbin.org/links/10/0", "https://httpbin.org/robots.txt" ] print(f"Asking crawler to fetch {len(urls_to_crawl)} URLs.") # Use arun_many for multiple URLs # We can still pass a config that applies to all URLs in the batch config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) results = await crawler.arun_many(urls=urls_to_crawl, config=config) print(f"\nFinished crawling! Got {len(results)} results.") for result in results: status = "Success" if result.success else "Failed" url_short = result.url.split('/')[-1] # Get last part of URL print(f"- URL: {url_short:<10} | Status: {status:<7} | Title: {result.metadata.get('title', 'N/A')}") if __name__ == "__main__": asyncio.run(main()) ``` **Explanation:** 1. **`urls_to_crawl = [...]`**: We define a list of URLs. 2. **`await crawler.arun_many(urls=urls_to_crawl, config=config)`**: We call `arun_many`, passing the list of URLs. It handles crawling them concurrently (like dispatching multiple delivery trucks or drones efficiently). 3. **`results`**: `arun_many` returns a list where each item is a `CrawlResult` object corresponding to one of the input URLs. `arun_many` is much more efficient for batch processing as it leverages `asyncio` to handle multiple fetches and processing tasks concurrently. It uses a [BaseDispatcher](10_basedispatcher.md) internally to manage this concurrency. ## Under the Hood (A Peek at the Code) You don't need to know the internal details to use `AsyncWebCrawler`, but seeing the structure can help. Inside the `crawl4ai` library, the file `async_webcrawler.py` defines this class. ```python # Simplified from async_webcrawler.py # ... imports ... from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy from .async_configs import BrowserConfig, CrawlerRunConfig from .models import CrawlResult from .cache_context import CacheContext, CacheMode # ... other strategy imports ... class AsyncWebCrawler: def __init__( self, crawler_strategy: AsyncCrawlerStrategy = None, # You can provide a strategy... config: BrowserConfig = None, # Configuration for the browser # ... other parameters like logger, base_directory ... ): # If no strategy is given, it defaults to Playwright (the 'truck') self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(...) self.browser_config = config or BrowserConfig() # ... setup logger, directories, etc. ... self.ready = False # Flag to track if setup is complete async def __aenter__(self): # This is called when you use 'async with'. It starts the strategy. await self.crawler_strategy.__aenter__() await self.awarmup() # Perform internal setup self.ready = True return self async def __aexit__(self, exc_type, exc_val, exc_tb): # This is called when exiting 'async with'. It cleans up. await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb) self.ready = False async def arun(self, url: str, config: CrawlerRunConfig = None) -> CrawlResult: # 1. Ensure config exists, set defaults (like CacheMode.ENABLED) crawler_config = config or CrawlerRunConfig() if crawler_config.cache_mode is None: crawler_config.cache_mode = CacheMode.ENABLED # 2. Create CacheContext to manage caching logic cache_context = CacheContext(url, crawler_config.cache_mode) # 3. Try reading from cache if allowed cached_result = None if cache_context.should_read(): cached_result = await async_db_manager.aget_cached_url(url) # 4. If cache hit and valid, return cached result if cached_result and self._is_cache_valid(cached_result, crawler_config): # ... log cache hit ... return cached_result # 5. If no cache hit or cache invalid/bypassed: Fetch fresh content # Delegate to the configured AsyncCrawlerStrategy async_response = await self.crawler_strategy.crawl(url, config=crawler_config) # 6. Process the HTML (scrape, filter, extract) # This involves calling other strategies based on config crawl_result = await self.aprocess_html( url=url, html=async_response.html, config=crawler_config, # ... other details from async_response ... ) # 7. Write to cache if allowed if cache_context.should_write(): await async_db_manager.acache_url(crawl_result) # 8. Return the final CrawlResult return crawl_result async def aprocess_html(self, url: str, html: str, config: CrawlerRunConfig, ...) -> CrawlResult: # This internal method handles: # - Getting the configured ContentScrapingStrategy # - Calling its 'scrap' method # - Getting the configured MarkdownGenerationStrategy # - Calling its 'generate_markdown' method # - Getting the configured ExtractionStrategy (if any) # - Calling its 'run' method # - Packaging everything into a CrawlResult # ... implementation details ... pass # Simplified async def arun_many(self, urls: List[str], config: Optional[CrawlerRunConfig] = None, ...) -> List[CrawlResult]: # Uses a Dispatcher (like MemoryAdaptiveDispatcher) # to run self.arun for each URL concurrently. # ... implementation details using a dispatcher ... pass # Simplified # ... other methods like awarmup, close, caching helpers ... ``` The key takeaway is that `AsyncWebCrawler` doesn't do the fetching or detailed processing *itself*. It acts as the central hub, coordinating calls to the various specialized `Strategy` classes based on the provided configuration. ## Conclusion You've met the General Manager: `AsyncWebCrawler`! * It's the **main entry point** for using Crawl4AI. * It **coordinates** all the steps: fetching, caching, scraping, extracting. * You primarily interact with it using `async with` and the `arun()` (single URL) or `arun_many()` (multiple URLs) methods. * It takes a URL and an optional `CrawlerRunConfig` object to customize the crawl. * It returns a comprehensive `CrawlResult` object. Now that you understand the central role of `AsyncWebCrawler`, let's explore how to give it detailed instructions for each crawling job. **Next:** Let's dive into the specifics of configuration with [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md). --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Crawl4AI/03_crawlerrunconfig.md ================================================ --- layout: default title: "CrawlerRunConfig" parent: "Crawl4AI" nav_order: 3 --- # Chapter 3: Giving Instructions - CrawlerRunConfig In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method. But what if we want to tell the crawler *how* to crawl that URL? Maybe we want it to take a picture (screenshot) of the page? Or perhaps we only care about a specific section of the page? Or maybe we want to ignore the cache and get the very latest version? Passing all these different instructions individually every time we call `arun` could get complicated and messy. ```python # Imagine doing this every time - it gets long! # result = await crawler.arun( # url="https://example.com", # take_screenshot=True, # ignore_cache=True, # only_look_at_this_part="#main-content", # wait_for_this_element="#data-table", # # ... maybe many more settings ... # ) ``` That's where `CrawlerRunConfig` comes in! ## What Problem Does `CrawlerRunConfig` Solve? Think of `CrawlerRunConfig` as the **Instruction Manual** for a *specific* crawl job. Instead of giving the `AsyncWebCrawler` manager lots of separate instructions each time, you bundle them all neatly into a single `CrawlerRunConfig` object. This object tells the `AsyncWebCrawler` exactly *how* to handle a particular URL or set of URLs for that specific run. It makes your code cleaner and easier to manage. ## What is `CrawlerRunConfig`? `CrawlerRunConfig` is a configuration class that holds all the settings for a single crawl operation initiated by `AsyncWebCrawler.arun()` or `arun_many()`. It allows you to customize various aspects of the crawl, such as: * **Taking Screenshots:** Should the crawler capture an image of the page? (`screenshot`) * **Waiting:** How long should the crawler wait for the page or specific elements to load? (`page_timeout`, `wait_for`) * **Focusing Content:** Should the crawler only process a specific part of the page? (`css_selector`) * **Extracting Data:** Should the crawler use a specific method to pull out structured data? ([ExtractionStrategy](06_extractionstrategy.md)) * **Caching:** How should the crawler interact with previously saved results? ([CacheMode](09_cachecontext___cachemode.md)) * **And much more!** (like handling JavaScript, filtering links, etc.) ## Using `CrawlerRunConfig` Let's see how to use it. Remember our basic crawl from Chapter 2? ```python # chapter3_example_1.py import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: url_to_crawl = "https://httpbin.org/html" print(f"Crawling {url_to_crawl} with default settings...") # This uses the default behavior (no specific config) result = await crawler.arun(url=url_to_crawl) if result.success: print("Success! Got the content.") print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Likely No # We'll learn about CacheMode later, but it defaults to using the cache else: print(f"Failed: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` Now, let's say for this *specific* crawl, we want to bypass the cache (fetch fresh) and also take a screenshot. We create a `CrawlerRunConfig` instance and pass it to `arun`: ```python # chapter3_example_2.py import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai import CrawlerRunConfig # 1. Import the config class from crawl4ai import CacheMode # Import cache options async def main(): async with AsyncWebCrawler() as crawler: url_to_crawl = "https://httpbin.org/html" print(f"Crawling {url_to_crawl} with custom settings...") # 2. Create an instance of CrawlerRunConfig with our desired settings my_instructions = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, # Don't use the cache, fetch fresh screenshot=True # Take a screenshot ) print("Instructions: Bypass cache, take screenshot.") # 3. Pass the config object to arun() result = await crawler.arun( url=url_to_crawl, config=my_instructions # Pass our instruction manual ) if result.success: print("\nSuccess! Got the content with custom config.") print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Should be Yes # Check if the screenshot file path exists in result.screenshot if result.screenshot: print(f"Screenshot saved to: {result.screenshot}") else: print(f"\nFailed: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` **Explanation:** 1. **Import:** We import `CrawlerRunConfig` and `CacheMode`. 2. **Create Config:** We create an instance: `my_instructions = CrawlerRunConfig(...)`. We set `cache_mode` to `CacheMode.BYPASS` and `screenshot` to `True`. All other settings remain at their defaults. 3. **Pass Config:** We pass this `my_instructions` object to `crawler.arun` using the `config=` parameter. Now, when `AsyncWebCrawler` runs this job, it will look inside `my_instructions` and follow those specific settings for *this run only*. ## Some Common `CrawlerRunConfig` Parameters `CrawlerRunConfig` has many options, but here are a few common ones you might use: * **`cache_mode`**: Controls caching behavior. * `CacheMode.ENABLED` (Default): Use the cache if available, otherwise fetch and save. * `CacheMode.BYPASS`: Always fetch fresh, ignoring any cached version (but still save the new result). * `CacheMode.DISABLED`: Never read from or write to the cache. * *(More details in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md))* * **`screenshot` (bool)**: If `True`, takes a screenshot of the fully rendered page. The path to the screenshot file will be in `CrawlResult.screenshot`. Default: `False`. * **`pdf` (bool)**: If `True`, generates a PDF of the page. The path to the PDF file will be in `CrawlResult.pdf`. Default: `False`. * **`css_selector` (str)**: If provided (e.g., `"#main-content"` or `.article-body`), the crawler will try to extract *only* the HTML content within the element(s) matching this CSS selector. This is great for focusing on the important part of a page. Default: `None` (process the whole page). * **`wait_for` (str)**: A CSS selector (e.g., `"#data-loaded-indicator"`). The crawler will wait until an element matching this selector appears on the page before proceeding. Useful for pages that load content dynamically with JavaScript. Default: `None`. * **`page_timeout` (int)**: Maximum time in milliseconds to wait for page navigation or certain operations. Default: `60000` (60 seconds). * **`extraction_strategy`**: An object that defines how to extract specific, structured data (like product names and prices) from the page. Default: `None`. *(See [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md))* * **`scraping_strategy`**: An object defining how the raw HTML is cleaned and basic content (like text and links) is extracted. Default: `WebScrapingStrategy()`. *(See [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md))* Let's try combining a few: focus on a specific part of the page and wait for something to appear. ```python # chapter3_example_3.py import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): # This example site has a heading 'H1' inside a 'body' tag. url_to_crawl = "https://httpbin.org/html" async with AsyncWebCrawler() as crawler: print(f"Crawling {url_to_crawl}, focusing on the H1 tag...") # Instructions: Only get the H1 tag, wait max 10s for it specific_config = CrawlerRunConfig( css_selector="h1", # Only grab content inside

tags page_timeout=10000 # Set page timeout to 10 seconds # We could also add wait_for="h1" if needed for dynamic loading ) result = await crawler.arun(url=url_to_crawl, config=specific_config) if result.success: print("\nSuccess! Focused crawl completed.") # The markdown should now ONLY contain the H1 content print(f"Markdown content:\n---\n{result.markdown.raw_markdown.strip()}\n---") else: print(f"\nFailed: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` This time, the `result.markdown` should only contain the text from the `

` tag on that page, because we used `css_selector="h1"` in our `CrawlerRunConfig`. ## How `AsyncWebCrawler` Uses the Config (Under the Hood) You don't need to know the exact internal code, but it helps to understand the flow. When you call `crawler.arun(url, config=my_config)`, the `AsyncWebCrawler` essentially does this: 1. Receives the `url` and the `my_config` object. 2. Before fetching, it checks `my_config.cache_mode` to see if it should look in the cache first. 3. If fetching is needed, it passes `my_config` to the underlying [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md). 4. The strategy uses settings from `my_config` like `page_timeout`, `wait_for`, and whether to take a `screenshot`. 5. After getting the raw HTML, `AsyncWebCrawler` uses the `my_config.scraping_strategy` and `my_config.css_selector` to process the content. 6. If `my_config.extraction_strategy` is set, it uses that to extract structured data. 7. Finally, it bundles everything into a `CrawlResult` and returns it. Here's a simplified view: ```mermaid sequenceDiagram participant User participant AWC as AsyncWebCrawler participant Config as CrawlerRunConfig participant Fetcher as AsyncCrawlerStrategy participant Processor as Scraping/Extraction User->>AWC: arun(url, config=my_config) AWC->>Config: Check my_config.cache_mode alt Need to Fetch AWC->>Fetcher: crawl(url, config=my_config) Note over Fetcher: Uses my_config settings (timeout, wait_for, screenshot...) Fetcher-->>AWC: Raw Response (HTML, screenshot?) AWC->>Processor: Process HTML (using my_config.css_selector, my_config.extraction_strategy...) Processor-->>AWC: Processed Data else Use Cache AWC->>AWC: Retrieve from Cache end AWC-->>User: Return CrawlResult ``` The `CrawlerRunConfig` acts as a messenger carrying your specific instructions throughout the crawling process. Inside the `crawl4ai` library, in the file `async_configs.py`, you'll find the definition of the `CrawlerRunConfig` class. It looks something like this (simplified): ```python # Simplified from crawl4ai/async_configs.py from .cache_context import CacheMode from .extraction_strategy import ExtractionStrategy from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy # ... other imports ... class CrawlerRunConfig(): """ Configuration class for controlling how the crawler runs each crawl operation. """ def __init__( self, # Caching cache_mode: CacheMode = CacheMode.BYPASS, # Default behavior if not specified # Content Selection / Waiting css_selector: str = None, wait_for: str = None, page_timeout: int = 60000, # 60 seconds # Media screenshot: bool = False, pdf: bool = False, # Processing Strategies scraping_strategy: ContentScrapingStrategy = None, # Defaults internally if None extraction_strategy: ExtractionStrategy = None, # ... many other parameters omitted for clarity ... **kwargs # Allows for flexibility ): self.cache_mode = cache_mode self.css_selector = css_selector self.wait_for = wait_for self.page_timeout = page_timeout self.screenshot = screenshot self.pdf = pdf # Assign scraping strategy, ensuring a default if None is provided self.scraping_strategy = scraping_strategy or WebScrapingStrategy() self.extraction_strategy = extraction_strategy # ... initialize other attributes ... # Helper methods like 'clone', 'to_dict', 'from_kwargs' might exist too # ... ``` The key idea is that it's a class designed to hold various settings together. When you create an instance `CrawlerRunConfig(...)`, you're essentially creating an object that stores your choices for these parameters. ## Conclusion You've learned about `CrawlerRunConfig`, the "Instruction Manual" for individual crawl jobs in Crawl4AI! * It solves the problem of passing many settings individually to `AsyncWebCrawler`. * You create an instance of `CrawlerRunConfig` and set the parameters you want to customize (like `cache_mode`, `screenshot`, `css_selector`, `wait_for`). * You pass this config object to `crawler.arun(url, config=your_config)`. * This makes your code cleaner and gives you fine-grained control over *how* each crawl is performed. Now that we know how to fetch content ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)), manage the overall process ([AsyncWebCrawler](02_asyncwebcrawler.md)), and give specific instructions ([CrawlerRunConfig](03_crawlerrunconfig.md)), let's look at how the raw, messy HTML fetched from the web is initially cleaned up and processed. **Next:** Let's explore [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md). --- Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) ================================================ FILE: docs/Crawl4AI/04_contentscrapingstrategy.md ================================================ --- layout: default title: "ContentScrapingStrategy" parent: "Crawl4AI" nav_order: 4 --- # Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy In [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md), we learned how to give specific instructions to our `AsyncWebCrawler` using `CrawlerRunConfig`. This included telling it *how* to fetch the page and potentially take screenshots or PDFs. Now, imagine the crawler has successfully fetched the raw HTML content of a webpage. What's next? Raw HTML is often messy! It contains not just the main article or product description you might care about, but also: * Navigation menus * Advertisements * Headers and footers * Hidden code like JavaScript (`