From Scripts to Systems: A Practical Architecture for AI Agents

Introduction: Why Move Beyond One-Off Prompts

One-off prompt scripts rarely survive complex, long-running, multi-goal tasks. To make agents that actually work, we must build systems: modular, observable, auditable, and governed. This essay offers a practical architecture for such agents, grounded in consensus research up to 2024. Any 2025 developments are noted as requiring further verification.

“Don’t just make models that answer; make systems that work — goals, plans, tools, memory, and evaluation in a tight loop.”

We’ll walk through an end-to-end pipeline — perception, memory, planning, tools, execution, and evaluation — and provide engineering examples, risk notes, a practice checklist, anti-patterns, cost/SLO guidance, and a concrete case.

From an engineering lens, a system-level agent is a controlled pipeline: external signals enter, information is structured, state accumulates, tasks are decomposed and orchestrated, tools are called, and results are audited and fed back.

To avoid hand-wavy abstraction, each module includes concrete scenarios (enterprise search, weekly report automation, clinical documentation assistance) and risk/governance notes. Key claims cite peer-reviewed or top-venue sources so you can verify and extend.

Perception and Context Construction

Perception converts external signals and history into usable context (text, structured data, multimodal) and uses retrieval to keep generation grounded.

Research shows Retrieval-Augmented Generation improves correctness and control in knowledge-heavy tasks Lewis et al., 2020. Multimodal fusion strengthens robustness for complex tasks (documented across NeurIPS/ICLR surveys and large-scale deployments).

Engineering trade-offs matter: more raw input isn’t always better. Prune, segment, and structure to reduce cost and noise. RAG quality depends on index construction, update cadence, and data governance.

This layer seeds the memory and planner with stable material and a shared baseline of state.

Engineering Example

Enterprise Q&A: Build a vector index over internal docs (PDFs, Confluence, code comments). Use paragraph-level chunking plus metadata filters; inject only the top 3–5 passages during answer generation to reduce hallucinations and cost.

Risk and Governance

Data governance: Define retrievable domains and confidentiality levels. Enforce hard filters against “unauthorized” content.
Index hygiene: Schedule rebuilds and incremental updates to avoid outdated knowledge.

Memory Systems: Short-Term, Long-Term, and Working Memory

Layered memory provides state continuity and traceability. The central questions are: when to store, when to forget, and how to find.

MemGPT proposes hierarchical memory and paging for long-lived interactions Lv et al., 2023. Transformer-XL offers longer-context modeling, distinct from external memory but complementary Dai et al., 2019.

Write strategy: capture “high-value events” and “state transitions” to control cost. Eviction strategy: time decay, access frequency, or task-phase heuristics. Retrieval strategy: vector search with metadata filters and semantic re-ranking to avoid noise.

This layer feeds planning and orchestration, preventing isolated actions and context drift.

Engineering Example

R&D assistant: Store “last 10 dialogue summaries”, “this week’s key events”, and “task queue status” in working memory; keep project docs and minutes in a long-term vector store. Query working memory first; fall back to long-term.

Implementation Notes

Logs and snapshots: Record structured event logs for state changes; support replay for debugging and audits.
Memory compression: Use “topic summaries + key quotes” to reduce context length; retrieve originals when needed.

Planning and Orchestration

Planning breaks goals into executable steps (decompose–order–manage dependencies) and defines human-in-the-loop and rollback paths.

Chain-of-Thought improves complex reasoning Wei et al., 2022; Self-Consistency increases robustness via multi-path sampling and voting Wang et al., 2022; ReAct couples reasoning and acting for tool use and environment interaction Yao et al., 2022.

Planners must define success metrics (thresholds and indicators), exception handling (retries and bypasses), and human confirmation interfaces.

Outputs flow into tools and executors to form observable, rollback-friendly workflows.

Engineering Example

Weekly report generation: “collect events → extract highlights → produce a structured draft → request human confirmation → publish to knowledge base” as five orchestrated steps with confirmation and rollback to keep critical outputs controlled.

Design Notes

Success measures: Validate each step’s output (JSON schema checks, keyword hits, link integrity).
Exception paths: Set retries and bypass strategies; degrade external APIs to cached or alternate sources when needed.

Tools and Executors

Tools include search, code execution, databases, and APIs. Executors encapsulate call protocols, sandboxes, and rate limits.

Toolformer suggests models can learn when and how to call tools via self-supervision Schick et al., 2023. Gorilla demonstrates robust connections to large API ecosystems Shen et al., 2023.

Permissions and rate: enforce least privilege, tiered tokens, and rate limits to prevent abuse and exhaustion. Auditability: log parameters, results, and side effects for diagnostics and compliance. Sandboxing: isolate code execution and external systems to reduce unpredictable risks.

Results feed the evaluation layer and drive the loop forward.

Engineering Example

Reporting automation: Use read-only credentials to access the data warehouse. Executors whitelist SQL and bind parameters; queue calls beyond rate limits to protect production systems.

Risk and Governance

Secrets management: tier and scope credentials by environment; prevent “dev keys” from touching production.
Output auditing: record “who called which tool when, with what output”, and produce traceable audit reports.

Evaluation and Feedback Loop

Use goal-oriented metrics — correctness, efficiency, cost, and explainability — to drive continuous improvement and align with privacy and governance.

TruthfulQA shows models can mimic plausible falsehoods, underscoring the need for factual evaluation Lin et al., 2021. Clinical contexts demand caution per viewpoints in JAMA and NEJM AI; risk, ethics, and human oversight are not optional.

Design metrics across four classes: task completion rate, factual correctness, side-effect cost, and latency/throughput. Build loops with self-reflection, external review, and A/B testing. Logs, versioning, and permissions enable audit and accountability.

Without evaluation, you don’t have a system — you have a one-off script.

Example Metrics

Correctness: citation hit rate; factual checks against labeled sets or external validators.
Efficiency: completion time, average steps, average tool call duration.
Cost: tokens per task, API fees, retry overhead.
Explainability: reproducibility, audit-log completeness, time-to-diagnose.

Loop Mechanics

Self-reflection: insert “self-check” nodes to validate logic and evidence at key steps.
External review: sample human evaluations and A/B tests to avoid drift.

Architecture at a Glance

The diagram below sketches the main path and feedback loop from external signals to evaluation.

flowchart LR
    A[External signals/data] --> B[Perception & parsing]
    B --> C[RAG retrieval]
    C --> D[Context assembly]
    D --> E[Planner]
    E --> F[Tool executors]
    F --> G[Evaluation & feedback]
    G --> C

Practice Checklist

Define data governance boundaries: accessible domains, confidentiality levels, and index update cadence.
Design layered memory and event logs: enable state replay and error localization.
Require verifiable outputs: JSON schemas, link checks, keyword hits.
Enforce permissions and rate control: least privilege, token tiers, throttling and queues.
Establish evaluation cadence: weekly reviews, A/B tests, error postmortems, and improvement plans.

Anti-Patterns and Risk

Monolithic prompts: dump everything into context, causing high cost and hallucinations.
Ungoverned tool use: no whitelists or audits; side effects and risk sources are opaque.
No evaluation loop: no metrics or sampled reviews; the system degrades silently.
Over-automation: skip human confirmation in high-risk domains (healthcare, finance).

Cost, SLOs, and Scaling

Cost model: total ≈ tokens * unit price + external API fees + infra. Track retries and fallbacks.
SLOs: set targets for accuracy ≥ X, latency ≤ Y, cost ≤ Z per scenario.
Scaling: start with a small closed loop; once metrics stabilize, expand data domains and tool scope.

Case Study: Weekly Report Assistant

Scenario

Generate a weekly R&D report from event logs and commit messages; request lead confirmation before publishing.

Flow

Perception: collect weekly events, PR merges, meeting summaries.
Memory: write into working memory; vectorize historical milestones for long-term storage.
Planning: “extract highlights → produce a structured draft → validate JSON → request human confirmation”.
Tools: call report templates and knowledge-base APIs; validate parameters and enforce rate limits.
Evaluation: JSON schema checks, link integrity, keyword hits; track accuracy and time-to-complete.
Feedback: review failures; update RAG re-ranking and summarization strategies.

Outcome and Iteration

Lower cost and better traceability. Human confirmation gate on critical output prevents mispublishing. Failures feed improvements to retrieval and summarization.

挑战与伦理考量

风险：权限滥用、越权调用、数据泄露、不可解释决策与隐性副作用。
治理：最小权限、审计日志、可解释性报告、人工监督与回滚机制。
合规：遵守地区法规（隐私、版权、医疗/金融等垂直规范）。

结论：以工程可控性换取智能体可持续性

关键要点：分层记忆、明确规划、严格工具治理、可审计评估闭环。
实践建议：从“单一场景的小闭环”试点，建立指标与审计后再扩容；持续以数据质量与合规为前提。

References and Further Reading

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. arXiv:2005.11401
Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning. arXiv:2203.11171
Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761
Shen, Y. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334
Lv, T. et al. (2023). MemGPT: Towards Teaching LLMs to Memorize (and Recollect). arXiv:2310.08559
Lin, S. et al. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958

参考建议（草拟）：DeepMind/Google Research、OpenAI/Anthropic 技术博客；Stanford HAI、MIT TR；学术期刊（Nature/Science/NeurIPS/ICLR）。