# 用 Python 构建一个可控的多 Agent 研究助手

- 原文：How to Build a Multi-Agent Research Assistant in Python
- 原文 URL：https://machinelearningmastery.com/how-to-build-a-multi-agent-research-assistant-in-python/
- 来源摘要路径：/home/lin/.hermes/projects/hermes-gsummary-workflow/runs/outputs/20260522-121938-How-to-Build-a-Multi-Agent-Research-Assistant-in-Python-1837182-650363080-summary.md
- 适用对象：想把“搜索、抓取、评估、补证、生成报告”做成自动化研究流程的 Python 开发者
- 说明：原文使用 OpenAI Agents SDK、Olostep、Pydantic 和 dotenv。本文把方法提炼成可复制教程，并补充离线可跑的最小版本、真实 API 接入骨架、质量闸门和失败处理。

## 1. 你要做的不是“一个会搜索的聊天机器人”

目标是做一个有闭环的研究助手：

```text
用户问题
  ↓
Manager Agent：决定先快速回答还是继续检索
  ↓
Search / Scrape 工具：收集候选证据
  ↓
Judge Agent：判断证据是否足够，给出评分和缺口
  ↓ 如果不够，继续补证
Analyst Agent：生成结构化研究报告
```

核心原则：

- Manager 只负责编排，不直接凭记忆给最终结论。
- Judge 必须独立于 Analyst，负责拦截低质量证据。
- Analyst 只在证据达标后写报告。
- 外部搜索和抓取必须有预算、超时和空结果处理。

## 2. 最小目录结构

先创建一个项目目录：

```bash
mkdir -p multi_agent_research_assistant
cd multi_agent_research_assistant
python3 -m venv .venv
source .venv/bin/activate
```

建议目录：

```text
multi_agent_research_assistant/
├── .env.example
├── README.md
├── app.py                  # 离线可跑的最小闭环
├── real_tools.py            # 真实 OpenAI / Olostep 接入骨架
├── requirements.txt
└── samples/
    └── evidence.json
```

先写依赖文件：

```bash
cat > requirements.txt <<'EOF'
openai-agents
olostep
python-dotenv
pydantic
EOF
```

如果只跑本文的离线最小闭环，暂时不需要安装这些依赖；它只用 Python 标准库。

## 3. 示例一：先跑一个离线闭环，理解 Manager / Judge / Analyst 分工

这个版本不调用任何外部 API，用固定样例模拟搜索结果。它的价值是验证流程：Manager 会收集证据，Judge 会评分，不够就继续补，达标后 Analyst 才输出报告。

新建 `app.py`：

```python
from __future__ import annotations

from dataclasses import dataclass, field
from typing import Protocol


@dataclass
class Evidence:
    title: str
    url: str
    content: str
    source_type: str = "web"


@dataclass
class Judgment:
    is_good_enough: bool
    score: float
    reason: str
    missing_information: list[str] = field(default_factory=list)


class SearchTool(Protocol):
    def search(self, query: str, limit: int = 3) -> list[Evidence]: ...


class MockSearchTool:
    def search(self, query: str, limit: int = 3) -> list[Evidence]:
        records = [
            Evidence(
                title="AI agents in business research",
                url="https://example.com/agents-business-research",
                content=(
                    "AI agents can search sources, compare evidence, and draft reports. "
                    "Production systems need source tracking and quality checks."
                ),
            ),
            Evidence(
                title="Agent evaluation patterns",
                url="https://example.com/agent-evaluation",
                content=(
                    "A judge component can score evidence sufficiency. "
                    "Thresholds such as 0.85 help decide when to stop searching."
                ),
            ),
            Evidence(
                title="Failure modes of web research agents",
                url="https://example.com/research-agent-failures",
                content=(
                    "Search APIs may return empty pages, stale snippets, or duplicated sources. "
                    "Systems should enforce retry limits and budget caps."
                ),
            ),
        ]
        return records[:limit]


class JudgeAgent:
    def judge(self, question: str, evidence: list[Evidence]) -> Judgment:
        if not evidence:
            return Judgment(False, 0.0, "没有证据", ["至少需要 2 个来源"])

        unique_urls = {item.url for item in evidence}
        has_quality_check = any("score" in item.content.lower() or "quality" in item.content.lower() for item in evidence)
        has_failure_mode = any("empty" in item.content.lower() or "failure" in item.content.lower() for item in evidence)

        score = 0.35
        score += min(len(unique_urls), 3) * 0.15
        if has_quality_check:
            score += 0.20
        if has_failure_mode:
            score += 0.15
        score = min(score, 1.0)

        missing: list[str] = []
        if len(unique_urls) < 2:
            missing.append("需要至少 2 个独立来源")
        if not has_quality_check:
            missing.append("缺少质量评估机制")
        if not has_failure_mode:
            missing.append("缺少失败模式和降级策略")

        return Judgment(
            is_good_enough=score >= 0.85,
            score=score,
            reason=f"当前证据评分 {score:.2f}",
            missing_information=missing,
        )


class AnalystAgent:
    def write_report(self, question: str, evidence: list[Evidence], judgment: Judgment) -> str:
        sources = "\n".join(f"- {item.title}: {item.url}" for item in evidence)
        findings = "\n".join(f"- {item.content}" for item in evidence)
        return f"""# Research Report

## Executive Summary

问题：{question}

当前证据评分：{judgment.score:.2f}。结论：多 Agent 研究助手应采用 Manager 编排、Judge 评估、Analyst 生成报告的闭环结构。

## Key Findings

{findings}

## Source Notes

{sources}
"""


class ManagerAgent:
    def __init__(self, search_tool: SearchTool, judge: JudgeAgent, analyst: AnalystAgent) -> None:
        self.search_tool = search_tool
        self.judge = judge
        self.analyst = analyst

    def run(self, question: str, max_rounds: int = 2) -> str:
        all_evidence: list[Evidence] = []
        last_judgment = Judgment(False, 0.0, "尚未评估", [])

        for round_index in range(1, max_rounds + 1):
            query = question if round_index == 1 else question + " " + " ".join(last_judgment.missing_information)
            new_evidence = self.search_tool.search(query=query, limit=3)
            all_evidence.extend(new_evidence)

            last_judgment = self.judge.judge(question, all_evidence)
            print(f"round={round_index} score={last_judgment.score:.2f} good={last_judgment.is_good_enough}")

            if last_judgment.is_good_enough:
                break

        if not last_judgment.is_good_enough:
            return (
                "证据仍未达标，停止生成最终报告。\n"
                f"原因：{last_judgment.reason}\n"
                f"缺口：{', '.join(last_judgment.missing_information)}"
            )

        return self.analyst.write_report(question, all_evidence, last_judgment)


def main() -> None:
    manager = ManagerAgent(
        search_tool=MockSearchTool(),
        judge=JudgeAgent(),
        analyst=AnalystAgent(),
    )
    report = manager.run("How should a business research assistant use AI agents?")
    print("\n" + report)


if __name__ == "__main__":
    main()
```

运行：

```bash
python app.py
```

预期输出形态：

```text
round=1 score=1.00 good=True

# Research Report

## Executive Summary
...
## Key Findings
...
## Source Notes
...
```

如果你把 `MockSearchTool` 改成只返回 1 条证据，应该看到“不达标，停止生成最终报告”的结果。这说明 Judge 闸门生效了。

## 4. 示例二：把搜索/抓取替换成真实 Olostep 工具

真实版本要处理 API key。不要把 key 写进代码，使用 `.env`：

```bash
cat > .env.example <<'EOF'
OPENAI_API_KEY=your_openai_api_key
OLOSTEP_API_KEY=your_olostep_api_key
EOF
cp .env.example .env
```

安装依赖：

```bash
pip install -r requirements.txt
```

新建 `real_tools.py`：

```python
from __future__ import annotations

import os
from dataclasses import dataclass
from typing import Any

from dotenv import load_dotenv
from olostep import Olostep

load_dotenv()


@dataclass
class Evidence:
    title: str
    url: str
    content: str
    source_type: str = "web"


class ToolError(RuntimeError):
    pass


def require_env(name: str) -> str:
    value = os.getenv(name)
    if not value:
        raise ToolError(f"Missing environment variable: {name}")
    return value


def safe_text(value: Any, max_chars: int = 8000) -> str:
    text = "" if value is None else str(value)
    text = text.strip()
    if len(text) > max_chars:
        return text[:max_chars] + "\n... [truncated]"
    return text


class OlostepSearchTool:
    def __init__(self) -> None:
        self.client = Olostep(api_key=require_env("OLOSTEP_API_KEY"))

    def search(self, query: str, limit: int = 5) -> list[Evidence]:
        try:
            result = self.client.searches.create(
                query=query,
                limit=limit,
                scrape_options={"formats": ["markdown"], "timeout": 25},
            )
        except Exception as exc:
            raise ToolError(f"Olostep search failed: {exc}") from exc

        evidence: list[Evidence] = []
        for link in getattr(result, "links", []) or []:
            url = link.get("url", "")
            title = link.get("title") or url
            markdown = link.get("markdown_content") or link.get("description") or ""
            if not url or len(markdown.strip()) < 200:
                continue
            evidence.append(
                Evidence(
                    title=title,
                    url=url,
                    content=safe_text(markdown),
                )
            )
        return evidence
```

这段代码只做一件事：把“搜索 + 抓取后的 Markdown”统一转换成内部 `Evidence`。这样 Manager / Judge / Analyst 不需要关心 Olostep SDK 的返回结构。

## 5. 示例三：把 Judge 设计成真正的质量闸门

原文的关键不是“会调用搜索 API”，而是 Judge 的评分和缺口反馈。建议把 Judge 输出固定为结构化字段：

```python
from dataclasses import dataclass, field


@dataclass
class Judgment:
    is_good_enough: bool
    score: float
    reason: str
    missing_information: list[str] = field(default_factory=list)
```

建议评分规则：

- `0.85–1.00`：证据充分，可以生成最终报告。
- `0.75–0.84`：基本可用，但缺一个关键来源、时间点或反例。
- `0.50–0.74`：只有局部证据，需要继续搜索。
- `0.25–0.49`：证据弱、陈旧或相关性差。
- `<0.25`：空数据或完全不相关。

最重要的是：Judge 不只给分，还要输出 `missing_information`，让 Manager 下一轮补证有方向。

示例 Prompt 片段：

```text
你是研究质量评估员。请判断当前证据是否足以回答用户问题。

评分标准：
- 0.85-1.00：证据充分，来源可信，无关键空白
- 0.75-0.84：信息较强，但缺少一个重要来源、细节或时效检查
- 0.50-0.74：只有局部证据，需要继续搜索
- 0.25-0.49：数据单薄、陈旧或弱相关
- <0.25：空数据或完全不相关

输出 JSON：
{
  "is_good_enough": true/false,
  "score": 0.0-1.0,
  "reason": "短解释",
  "missing_information": ["仍需补充的信息"]
}
```

## 6. 报告输出模板

为了避免 Analyst 乱写结构，最终报告建议固定章节：

```text
# Research Report

## Executive Summary
一句话回答问题，并说明证据强度。

## Key Findings
用 bullet 列出核心发现，每条对应来源。

## Context
解释背景和问题边界。

## Evidence Review
说明证据来源、时间、可信度、冲突点。

## Detailed Analysis
展开分析，不把推论伪装成事实。

## Implications
说明对业务、技术或决策的影响。

## Source Notes
列出来源 URL 和抓取局限。

## References
列出引用链接。
```

注意：原文强调不让报告随意扩展章节。你的生产版本也应该把允许章节写进 Analyst 的 System Instructions。

## 7. 给 Manager 写清楚五步决策规则

Manager 的指令应该像流程图，而不是泛泛地说“请帮我研究”。推荐规则：

```text
你是研究流程编排器。

必须按以下步骤执行：
1. 先调用快速回答工具，获得初始答案或初始搜索方向。
2. 调用 judge_answer_quality 评估证据质量。
3. 如果 score >= 0.85 且 is_good_enough=true，调用 analyst 生成最终报告，然后停止。
4. 如果不达标，调用 search_with_scrape，并优先搜索 missing_information 指出的缺口。
5. 如果搜索结果中有高价值 URL 但内容不足，调用 scrape_url 进行 URL 级补抓。
6. 达到 max_rounds、max_sources、max_cost 或连续空结果时停止，不要无限循环。
```

这个规则比“让 Agent 自己想办法”更可靠。

## 8. 必须补上的生产防护

原文流程适合教学，但生产化至少要补 6 个防护：

### 8.1 空结果防护

```python
def has_enough_text(evidence: list[Evidence], min_chars: int = 500) -> bool:
    return any(len(item.content.strip()) >= min_chars for item in evidence)
```

如果连续两轮没有足够正文，直接停止并返回“无法提取足够证据”，不要让 Agent 一直搜。

### 8.2 成本预算

```python
@dataclass
class ResearchBudget:
    max_rounds: int = 3
    max_sources: int = 8
    max_scrape_calls: int = 5
```

Manager 每次搜索/抓取前都检查预算。

### 8.3 去重

```python
def dedupe_by_url(items: list[Evidence]) -> list[Evidence]:
    seen: set[str] = set()
    result: list[Evidence] = []
    for item in items:
        if item.url in seen:
            continue
        seen.add(item.url)
        result.append(item)
    return result
```

### 8.4 来源透明

最终报告必须保留来源 URL。没有来源的结论只能标记为推论。

### 8.5 Trace 审计

每次运行保存：

```text
- 用户问题
- 搜索 query
- 使用的 URL
- Judge 分数
- missing_information
- 最终报告路径
```

### 8.6 降级策略

如果搜索 API 不可用：

- 返回已有证据摘要；或
- 要求用户提供链接/文本；或
- 明确报告“未能完成实时检索”。

不要假装完成了研究。

## 9. 验收清单

做完后，用这份清单验收：

- 能离线跑通 Manager / Judge / Analyst 闭环。
- Judge 输出结构化字段，而不是自然语言一段话。
- `score < 0.85` 时不会生成最终报告，或会继续补证。
- 连续空搜索/空抓取会停止，不会无限循环。
- 最终报告包含来源 URL。
- 最终报告区分事实、证据和推论。
- API key 只在 `.env`，不写进代码。
- 运行日志能追踪每轮搜索和 Judge 分数。

## 10. 常见问题处理

### 问题 1：搜索结果很多，但报告仍然很差

可能原因：Judge 只看数量，不看质量。修复方法：评分时检查正文长度、来源类型、发布时间、是否包含一手来源。

### 问题 2：Agent 不停搜索，成本失控

可能原因：没有预算上限。修复方法：加入 `max_rounds`、`max_sources`、`max_scrape_calls`，并把“连续空结果停止”写进 Manager 规则。

### 问题 3：报告引用了来源里没有的内容

可能原因：Analyst 把推论写成事实。修复方法：要求每个关键发现绑定 URL；没有来源的内容必须标记为“推论”。

### 问题 4：网页抓取返回空 Markdown

可能原因：反爬、登录墙、页面动态渲染。修复方法：换 URL、请求用户提供正文、或使用浏览器/专用抓取服务；不要用标题和 snippet 生成长报告。

## 11. 最小可复制版本总结

如果你只想快速复刻文章方法，按这个顺序做：

1. 先跑本文 `app.py` 离线版本，验证闭环。
2. 把 `MockSearchTool` 替换成 `OlostepSearchTool`。
3. 把 `JudgeAgent` 替换成真实 LLM Judge，但保留 `Judgment` 结构。
4. 把 `AnalystAgent` 替换成真实 LLM Analyst，但固定报告章节。
5. 给 Manager 加预算、空结果停止、去重、来源记录。
6. 保存每次运行的 trace，方便复盘和调参。

真正有价值的不是“多个 Agent”，而是这条控制链：

```text
编排器负责调度 → 工具负责取数 → Judge 负责质量闸门 → Analyst 负责表达 → 预算和日志负责生产可控
```