From Scripts to Systems: A Practical Architecture for AI Agents | WhatAICanDo Skip to content

From Scripts to Systems: A Practical Architecture for AI Agents

Devin
Published date:
7 min read

Introduction: Why Move Beyond One-Off Prompts

One-off prompt scripts rarely survive complex, long-running, multi-goal tasks. To make agents that actually work, we must build systems: modular, observable, auditable, and governed. This essay offers a practical architecture for such agents, grounded in consensus research up to 2024. Any 2025 developments are noted as requiring further verification.

“Don’t just make models that answer; make systems that work — goals, plans, tools, memory, and evaluation in a tight loop.”

We’ll walk through an end-to-end pipeline — perception, memory, planning, tools, execution, and evaluation — and provide engineering examples, risk notes, a practice checklist, anti-patterns, cost/SLO guidance, and a concrete case.

From an engineering lens, a system-level agent is a controlled pipeline: external signals enter, information is structured, state accumulates, tasks are decomposed and orchestrated, tools are called, and results are audited and fed back.

To avoid hand-wavy abstraction, each module includes concrete scenarios (enterprise search, weekly report automation, clinical documentation assistance) and risk/governance notes. Key claims cite peer-reviewed or top-venue sources so you can verify and extend.

Perception and Context Construction

Perception converts external signals and history into usable context (text, structured data, multimodal) and uses retrieval to keep generation grounded.

Research shows Retrieval-Augmented Generation improves correctness and control in knowledge-heavy tasks Lewis et al., 2020. Multimodal fusion strengthens robustness for complex tasks (documented across NeurIPS/ICLR surveys and large-scale deployments).

Engineering trade-offs matter: more raw input isn’t always better. Prune, segment, and structure to reduce cost and noise. RAG quality depends on index construction, update cadence, and data governance.

This layer seeds the memory and planner with stable material and a shared baseline of state.

Engineering Example

Risk and Governance

Memory Systems: Short-Term, Long-Term, and Working Memory

Layered memory provides state continuity and traceability. The central questions are: when to store, when to forget, and how to find.

MemGPT proposes hierarchical memory and paging for long-lived interactions Lv et al., 2023. Transformer-XL offers longer-context modeling, distinct from external memory but complementary Dai et al., 2019.

Write strategy: capture “high-value events” and “state transitions” to control cost. Eviction strategy: time decay, access frequency, or task-phase heuristics. Retrieval strategy: vector search with metadata filters and semantic re-ranking to avoid noise.

This layer feeds planning and orchestration, preventing isolated actions and context drift.

Engineering Example

Implementation Notes

Planning and Orchestration

Planning breaks goals into executable steps (decompose–order–manage dependencies) and defines human-in-the-loop and rollback paths.

Chain-of-Thought improves complex reasoning Wei et al., 2022; Self-Consistency increases robustness via multi-path sampling and voting Wang et al., 2022; ReAct couples reasoning and acting for tool use and environment interaction Yao et al., 2022.

Planners must define success metrics (thresholds and indicators), exception handling (retries and bypasses), and human confirmation interfaces.

Outputs flow into tools and executors to form observable, rollback-friendly workflows.

Engineering Example

Design Notes

Tools and Executors

Tools include search, code execution, databases, and APIs. Executors encapsulate call protocols, sandboxes, and rate limits.

Toolformer suggests models can learn when and how to call tools via self-supervision Schick et al., 2023. Gorilla demonstrates robust connections to large API ecosystems Shen et al., 2023.

Permissions and rate: enforce least privilege, tiered tokens, and rate limits to prevent abuse and exhaustion. Auditability: log parameters, results, and side effects for diagnostics and compliance. Sandboxing: isolate code execution and external systems to reduce unpredictable risks.

Results feed the evaluation layer and drive the loop forward.

Engineering Example

Risk and Governance

Evaluation and Feedback Loop

Use goal-oriented metrics — correctness, efficiency, cost, and explainability — to drive continuous improvement and align with privacy and governance.

TruthfulQA shows models can mimic plausible falsehoods, underscoring the need for factual evaluation Lin et al., 2021. Clinical contexts demand caution per viewpoints in JAMA and NEJM AI; risk, ethics, and human oversight are not optional.

Design metrics across four classes: task completion rate, factual correctness, side-effect cost, and latency/throughput. Build loops with self-reflection, external review, and A/B testing. Logs, versioning, and permissions enable audit and accountability.

Without evaluation, you don’t have a system — you have a one-off script.

Example Metrics

Loop Mechanics

Architecture at a Glance

The diagram below sketches the main path and feedback loop from external signals to evaluation.

flowchart LR
    A[External signals/data] --> B[Perception & parsing]
    B --> C[RAG retrieval]
    C --> D[Context assembly]
    D --> E[Planner]
    E --> F[Tool executors]
    F --> G[Evaluation & feedback]
    G --> C

Practice Checklist

Anti-Patterns and Risk

Cost, SLOs, and Scaling

Case Study: Weekly Report Assistant

Scenario

Flow

  1. Perception: collect weekly events, PR merges, meeting summaries.
  2. Memory: write into working memory; vectorize historical milestones for long-term storage.
  3. Planning: “extract highlights → produce a structured draft → validate JSON → request human confirmation”.
  4. Tools: call report templates and knowledge-base APIs; validate parameters and enforce rate limits.
  5. Evaluation: JSON schema checks, link integrity, keyword hits; track accuracy and time-to-complete.
  6. Feedback: review failures; update RAG re-ranking and summarization strategies.

Outcome and Iteration

挑战与伦理考量

结论:以工程可控性换取智能体可持续性

References and Further Reading

参考建议(草拟):DeepMind/Google Research、OpenAI/Anthropic 技术博客;Stanford HAI、MIT TR;学术期刊(Nature/Science/NeurIPS/ICLR)。

Previous
The Ultimate Form of AI: Environmentalized Intelligence and the Personal Operating System (Hope and Critique in Parallel)
Next
Collaborative Diagnosis: A Closed Loop Across Imaging and Pathology