EngineeringApr 202612 min read

Agent workflow patterns for enterprise engineering teams

Architecture patterns for deploying AI agents in production: tool use, human-in-the-loop supervision, multi-agent orchestration, and observability. Practical guidance for engineering leads.

From chat to workflows

Most teams' first experience with AI agents is conversational. A developer asks a question, the model responds, but the real productivity gains come when agents execute structured workflows: reviewing PRs, generating tests, triaging issues, updating documentation, or orchestrating multi-step deployments.

The difference between a chatbot and a workflow agent is determinism. A chatbot can give a different answer each time. A workflow agent needs to produce reliable, auditable results within defined boundaries, that requires architecture, not just a model and a prompt.

This guide covers the four patterns we use when deploying agent workflows for engineering teams, along with the observability and governance infrastructure that makes them production-ready.

Pattern 1: Tool-use agents

The simplest agentic pattern. The model receives a task and a set of tools it can call. It decides which tools to use, in what order, and synthesises the results.

python
tools = [
    {
        "name": "read_file",
        "description": "Read contents of a file in the repository",
        "parameters": {
            "path": {"type": "string", "description": "File path relative to repo root"}
        }
    },
    {
        "name": "run_tests",
        "description": "Execute the test suite and return results",
        "parameters": {
            "test_path": {"type": "string", "description": "Path to test file or directory"}
        }
    },
    {
        "name": "write_file",
        "description": "Write content to a file",
        "parameters": {
            "path": {"type": "string"},
            "content": {"type": "string"}
        }
    }
]

When to use: Tasks where the model needs to gather information before acting. Code review (read files, check test results, write review comments). Documentation generation (read source, understand API, generate docs).

Key design decisions:

  • Tool granularity: Too few tools and the agent can't do useful work. Too many and it struggles to choose the right one. We typically start with 5-10 tools per workflow and refine based on usage patterns.

  • Tool sandboxing: Every tool should validate inputs, enforce access controls, and log invocations. A write_file tool that accepts arbitrary paths is a security risk. Restrict paths, validate content, and log everything.

  • Error handling: Tools fail. Files don't exist, tests time out, APIs return errors. The agent needs clear error messages it can reason about, not stack traces.

Pattern 2: Human-in-the-loop workflows

The agent executes a workflow but pauses at defined checkpoints for human review and approval before continuing.

python
class ReviewWorkflow:
    def __init__(self, agent, reviewer):
        self.agent = agent
        self.reviewer = reviewer

    async def run(self, task):
        # Step 1: Agent analyses (no approval needed)
        analysis = await self.agent.analyse(task)

        # Step 2: Agent proposes changes (approval required)
        proposal = await self.agent.propose_changes(analysis)
        approved = await self.reviewer.review(proposal)

        if not approved:
            return {"status": "rejected", "proposal": proposal}

        # Step 3: Agent applies approved changes
        result = await self.agent.apply_changes(proposal)
        return {"status": "applied", "result": result}

When to use: Any workflow where the agent modifies production code, interacts with external systems, or produces outputs that will be customer-facing. The human checkpoint reduces risk without eliminating the productivity benefit.

Checkpoint placement: The key design question is where to place approval gates. Too many checkpoints and the workflow becomes slower than doing it manually. Too few and you lose the safety benefit.

Our rule of thumb: require approval before any write action (file modifications, API calls, deployments) but allow read actions (code analysis, test execution, log inspection) to proceed autonomously. This preserves most of the speed benefit while maintaining oversight over state changes.

Pattern 3: Autonomous task agents

The agent receives a well-defined task and executes it end-to-end without human intervention. Results are reviewed after completion rather than during execution.

When to use: High-volume, lower-risk tasks where the cost of review during execution outweighs the risk. Test generation, documentation updates, dependency upgrades, lint fixes.

Critical requirements for autonomous agents:

  • Bounded scope: The agent must have a clearly defined task boundary. "Write tests for this file" is bounded. "Improve the codebase" is not.

  • Reversibility: Every action the agent takes must be reversible. Work on branches, not main. Use atomic commits. Provide rollback mechanisms.

  • Validation gates: Automated checks that run after the agent completes. Tests must pass. Linting must pass. Coverage must not decrease, if validation fails, the agent's changes are rejected automatically.

python
async def autonomous_test_generation(file_path: str):
    branch = f"agent/tests-{file_path.replace('/', '-')}"
    git.checkout_new_branch(branch)

    try:
        tests = await agent.generate_tests(file_path)
        write_test_file(tests)

        result = run_test_suite()
        if result.all_passed and result.coverage_delta >= 0:
            git.commit(f"Add AI-generated tests for {file_path}")
            create_pull_request(branch)
        else:
            git.discard_changes()
            log_failure(file_path, result)
    finally:
        git.checkout("main")

Pattern 4: Multi-agent orchestration

Multiple specialised agents work on different aspects of a task, coordinated by an orchestrator that routes work and aggregates results.

Example: Automated PR review pipeline

Flow: The orchestrator distributes work to three specialist agents. Code review, test coverage, and security scanning, running in parallel. Each returns findings independently. The orchestrator merges all results into a single aggregated review.

Each agent is a specialist: the code review agent checks logic and patterns, the test coverage agent verifies test quality, the security scanner checks for vulnerabilities and secrets. The orchestrator distributes work and merges findings into a single review.

When to use: Complex tasks that benefit from specialisation. A single generalist agent trying to do code review, security scanning, and test analysis in one pass will produce worse results than three specialists working in parallel.

Design considerations:

  • Agent communication: Agents shouldn't talk to each other directly. The orchestrator manages all data flow. This keeps the system debuggable and prevents cascading failures.

  • Conflict resolution: When agents disagree (code review agent says the implementation is fine, security agent flags a concern), the orchestrator applies priority rules. Security findings always override style suggestions.

  • Timeout management: Each agent gets a time budget, if an agent hangs or takes too long, the orchestrator proceeds with results from agents that completed. Partial results are more useful than no results.

Observability for agent workflows

Production agent workflows need the same observability as any other production system. Plus additional instrumentation specific to AI behaviour.

What to log

  • Every tool invocation: Tool name, input parameters, output, latency, success/failure. This is your audit trail.

  • Token usage per step: Track cost at the workflow level, not just the API call level. A workflow that makes 15 model calls to complete a task has different cost characteristics than one that makes 3.

  • Decision points: When the agent chooses between tools or decides to retry, log the reasoning. This is essential for debugging unexpected behaviour.

  • Human intervention events: When a human overrides, approves, or rejects an agent action, log the decision and rationale.

What to alert on

  • Unusual tool usage patterns: If the write_file tool suddenly gets called 10x more than normal, something may be wrong.

  • Cost spikes: A workflow that usually costs $0.50 suddenly costing $15 indicates a loop or unexpected complexity.

  • Validation failure rates: If autonomous agents start failing validation gates more frequently, model behaviour may have changed (provider update, prompt drift) or the codebase has evolved in ways the agent handles poorly.

python
class WorkflowObserver:
    def on_tool_call(self, tool_name, params, result, latency_ms):
        self.metrics.tool_calls.labels(tool=tool_name).inc()
        self.metrics.tool_latency.labels(tool=tool_name).observe(latency_ms)
        self.audit_log.record(
            event="tool_call",
            tool=tool_name,
            params=params,
            result_summary=summarise(result),
            latency_ms=latency_ms,
        )

    def on_workflow_complete(self, workflow_id, status, total_tokens, cost):
        self.metrics.workflow_duration.observe(self.elapsed(workflow_id))
        self.metrics.workflow_cost.observe(cost)
        if cost > self.cost_threshold:
            self.alert(f"Workflow {workflow_id} cost ${cost:.2f} (threshold: ${self.cost_threshold})")

Governance integration

Agent workflows need governance that's proportional to their autonomy and risk surface.

Read-only workflows (code analysis, documentation review): Minimal governance. Log interactions, review outputs periodically, monitor for data exposure.

Write workflows with human approval (PR generation, code modification): Moderate governance. Audit logging, access controls on what the agent can modify, approval workflows for production-touching changes.

Autonomous write workflows (automated test generation, lint fixes): Strict governance. Bounded scope, automated validation gates, branch-only operation, mandatory PR review of agent outputs, cost and volume limits.

The goal is governance that scales with risk, not governance that applies maximum restriction to every workflow regardless of impact.

Getting started

If you're introducing agent workflows to your engineering team, start with Pattern 1 (tool-use) or Pattern 2 (human-in-the-loop) applied to a single, well-defined use case. Test generation and automated PR review are the two highest-leverage starting points we've seen.

Invest in observability from day one, not after something goes wrong. The logging you build for your first workflow becomes the foundation for every workflow that follows.

If you're planning agent workflow deployments and want help with architecture, governance, and rollout strategy, book a diagnostic. We'll review your engineering workflows and identify the patterns that fit your team's risk tolerance and capability.

Ready to put this into practice?