Most teams start by giving every engineer a coding assistant and hoping velocity goes up. A quarter later the picture is mixed: a few engineers are visibly faster, most are about the same, the bill is real, and nobody can say whether shipping software got easier. The reason is usually simple. The tool changed; the workflow did not.
An agentic SDLC is not a faster way to type code. It is a software development process where agents take on complete chunks of work: draft a plan, implement a change, write the tests, open the PR, triage the failing build, or propose the fix. Engineers still specify intent, set the rules, and make the judgment calls.
This article is the sequence we use to move a team from manual coding to agent-assisted and agent-led work across the five places the work actually lives: planning, implementation, quality, DevOps, and issue resolution. You do not flip a switch. You change one stage at a time, and you only advance once the checks around the previous stage are strong enough to catch bad output. Skip that and you get more code, more bugs, and more review load than you started with.
First, be honest about where the manual work is
Before changing anything, map where engineering time actually goes. Not where you think it goes: where the calendar and the PR history say it goes. Most teams find that pure code authoring is a smaller slice than expected, and that the real time sinks are the things around it: turning a vague ticket into an executable plan, writing tests nobody wanted to write, chasing a flaky pipeline, and reconstructing context during a production incident.
Spend a week measuring five things per engineer: time from ticket to first PR, share of PR review time spent on mechanical issues, test coverage on changed lines, mean time to green on CI, and mean time to resolve a production issue. These five numbers are your baseline, and they map closely to the five stages below. Use them to decide where to start: the biggest pain, with the lowest operational risk.
A maturity model for the agentic SDLC
It helps to name the levels, because "we use AI" can mean four very different things:
| Level | What it looks like | Who is in the loop |
|---|---|---|
| Manual | Engineer writes everything; AI is autocomplete at most | Human does the work, reviews the work |
| Assisted | Engineer drives; agent does sub-tasks on request (a function, a test, a refactor) | Human drives every step |
| Delegated | Engineer specifies a unit of work; agent executes it end to end; human reviews the result | Human specifies and reviews; agent executes |
| Orchestrated | Engineer runs several agents in parallel across tasks; reviews, integrates, and steps in on exceptions | Human orchestrates and approves; agents execute and self-correct |
The goal is not to make every stage "orchestrated" as fast as possible. Move one stage up at a time, and expect different stages to mature at different speeds. A team can delegate implementation while planning is still assisted and incident response is still fully manual. That is normal. Forcing uniformity is how rollouts fail.
Stage 0: make the codebase legible to agents
Every later stage depends on this one, and it is the step teams skip. An agent is only as good as the context it can load. A codebase with no written conventions, no architecture notes, and tribal knowledge that lives in three senior engineers' heads will produce mediocre output no matter which tool you use.
The work here is plain but valuable:
-
Write the context files agents read. A root instructions file (a
CLAUDE.mdor equivalent) that states the stack, the architectural rules, the testing conventions, the commands to build and test, and the things that look wrong but are deliberate. Put one in each major package, not just the root. -
Make conventions executable. A convention that lives in a wiki is invisible to an agent. The same convention encoded as a linter rule, a formatter config, a type, or a test is enforced on every run. Move as many "we always do it this way" rules as you can from prose into checks.
-
Stabilise the commands. Agents need a deterministic way to build, run, test, and lint. If the only person who can get the test suite running locally is the engineer who wrote it, fix that first. A one-command bootstrap pays for itself immediately.
-
Capture the domain. A short glossary of domain terms and a one-page map of the major services and how they call each other removes the most common cause of wrong agent output: the system uses nouns the agent cannot infer from code alone.
This stage has no flashy demo, which is exactly why it gets skipped. It is also why the teams that do it pull ahead: agents finally get the same basic map of the system that engineers already carry around.
Stage 1: fix planning
Manual planning produces tickets written for humans: a sentence of intent, some context the author assumed everyone shared, and an implicit plan that lives in the assignee's head. Agents cannot execute that reliably. The first real shift is that planning produces specifications an agent can act on, and the agent helps produce them.
The change is not "write longer tickets". It is a two-step loop. First, a human states intent and constraints. Then an agent expands that into a concrete plan (the files it expects to touch, the approach, the edge cases, the test strategy, the open questions) which the human reviews and corrects before any code is written. Teams that do this merge more of what agents produce, because they catch bad assumptions before they become diffs.
In practice the planning note looks like this, and it lives in the repo or the ticket, not in a chat window that vanishes:
## Intent
Let admins bulk-export the audit log as CSV, respecting the
existing tenant isolation rules.
## Constraints
- Must reuse AuditQuery; do not write raw SQL.
- Export runs async; no request may block > 30s.
- Tenant scoping is non-negotiable and must be tested.
## Plan (agent-proposed, human-reviewed)
1. Add AuditExportJob behind the existing job queue.
2. Stream rows via AuditQuery.paginate to avoid loading all in memory.
3. New endpoint POST /admin/audit/export returns a job id.
4. Tests: tenant isolation, large export (>100k rows), auth.
## Open questions
- Retention of generated CSVs? (need product decision)
Two things make this work at a team level. The plan is reviewed before implementation, which is far cheaper than reviewing a finished diff: correcting an approach costs a comment; correcting a diff costs a rewrite. The open questions surface the human decisions (product trade-offs, security calls) early, instead of letting the agent guess halfway through. Planning is the stage where you most want the human firmly in the loop, even once implementation is delegated.
Bring product and design into the lifecycle
The mistake most teams make is treating the agentic SDLC as an engineering-only programme. But the spec an agent executes begins as a product decision and a design, not an engineering ticket. If product and design stay outside the loop, agents build the wrong thing faster. The "intent" and "open questions" in the planning note above are exactly where product managers and designers belong.
On the product side, intent has to be more than a sentence in a tracker. Product managers who write a crisp statement of the user problem, constraints, and success criteria give the planning agent something it can expand into a sound plan. Vague intent produces plausible but wrong plans. Two habits pay off quickly:
-
Product writes intent and acceptance criteria, the agent drafts the plan, and they review it together. This catches "that is not what we meant" before a line of code exists. The product owner is reviewing an approach, not a backlog of finished features built on a misread.
-
Prototyping moves left. Agents make it cheap to spin up a working prototype of a flow before committing engineering time to it. Product and design can validate the real thing with users in days, and the throwaway prototype becomes a precise spec for the delegated build.
On the design and UX side, the connection is direct because design tools already contain structured data. Design tokens, component specs, and layout rules in a tool like Figma can drive generated UI, with the design source as the contract and a visual-regression check catching drift. Chronic design drift becomes easier to catch because keeping code in sync with the design is a reviewable task, not manual pixel-peeping.
What stays human is the judgment product and design exist to provide: taste, trade-offs between user needs, accessibility and interaction decisions, and whether the built thing actually feels right. The agent removes translation work between intent, design, and code; it does not decide what is worth building or what good looks like. Designers and PMs shift from producing artefacts by hand to specifying intent and reviewing output critically.
Stage 2: delegate implementation
This is the stage everyone starts with, and the one that goes wrong when stages 0 and 1 were skipped. With a legible codebase and a reviewed plan in hand, implementation moves from assisted (engineer writes, agent suggests) to delegated (engineer hands the agent a planned unit of work and reviews the result).
The mechanics matter. Delegated implementation works when the unit of work is the right size: small enough to review in one sitting, large enough to be worth delegating. A good rule is one reviewable PR per delegated task, scoped so a human can hold the whole change in their head. The agent implements against the plan, runs the tests and linters, iterates until they pass, and only then surfaces the change for human review. The engineer's job shifts from writing the code to specifying it well and reviewing it critically.
The failure mode to design against is the debug loop: ask the agent, run, fail, ask again, run, fail, on repeat. It looks like productive work and burns disproportionate cost and time at a low merge rate. When an agent has failed the same check two or three times, the right move is to stop, return to the plan, and re-specify rather than keep prompting. Teach the team to read a stuck agent as a planning signal, not a prompting problem.
At this stage you should already see the baseline numbers move: time from ticket to first PR drops, and the share of changes that arrive with passing tests and a clear description goes up. If they do not, the problem is upstream, in stage 0 or stage 1, not in the coding tool.
Stage 3: harden quality
Quality is the stage that decides whether the whole programme is safe. Once agents write a meaningful share of your code, manual review becomes the bottleneck and can create a false sense of security. You cannot eyeball your way to confidence when the volume of change doubles. Quality has to become a set of automated gates that catch routine failures before a human reads the diff.
Three layers do most of the work:
-
Tests as the contract. Agents are exceptionally good at generating tests from a specification, and tests are how you make agent output verifiable. Require that every delegated change ships with tests, generated by the agent and reviewed by the human, and bias the suite toward the behaviours that matter rather than coverage for its own sake. The plan from stage 1 already named the test strategy; this is where it is enforced.
-
Automated code review in the pipeline. An agent reviewer running on every PR catches issues such as missing error handling, ownership checks, convention violations, and performance footguns before they eat senior review time. It does not replace human review; it removes the pattern-matching work so humans spend their attention on design and trade-offs.
-
Golden datasets for anything probabilistic. If your product itself uses models, deterministic tests are not enough. A small, curated evaluation set, run on every change, is what tells you whether a prompt or model change made the product better or worse. We wrote the ten rules we use to build one; the short version is that it should be sampled from real traffic, sized to be reviewed, and owned by one person.
The principle that ties these together is that humans design and audit the gates while the gates do the checking. Senior engineers spend their time on the checks (the tests, the review prompts, the eval rubrics) and handling the judgment calls those checks escalate. They do not need to read every diff by hand. A team that gets this right can delegate more implementation work without the review queue becoming the new bottleneck.
Stage 4: speed up delivery
Once code is being produced and verified at a higher rate, delivery becomes the constraint. A pipeline tuned for manual coding will choke on the throughput of an agent-assisted team. A flaky or slow pipeline also poisons every stage upstream of it, because agents and engineers waste time waiting on, and re-running, an unreliable signal.
The work here runs in two directions. First, make the pipeline itself fast and trustworthy enough to be the source of truth that agents act on: deterministic, well under the time budget where people start context-switching away, and free of flaky tests that train everyone to ignore red. An agent that cannot trust the build signal cannot self-correct against it.
Second, point agents at the DevOps work itself, which is unusually well suited to delegation because it is config-heavy, pattern-based, and well documented:
-
Pipeline and infrastructure changes. CI config, IaC, Dockerfiles, and deployment manifests are structured, convention-driven work that agents handle well, with the same plan-then-implement-then-review loop as application code.
-
Agentic build triage. When CI goes red, an agent can read the failure, correlate it with the diff, and either propose a fix or explain the breakage, so the engineer arrives at a diagnosis rather than a wall of logs.
-
Release notes, changelogs, and migration steps. The release paperwork is mechanical and tedious, which makes it a good fit for agents as long as a human approves the result.
The safety rule is simple: agents propose, the pipeline verifies, and a human approves anything that touches production. Delegation gets you a ready-to-merge change and a green build; promotion to a live environment stays deliberate and human-approved.
Stage 5: shorten issue resolution
Incident response is the last stage to change and the one with the highest payoff, because it is where expensive manual work happens under pressure. At 2am, the slow part of resolving an issue is rarely the fix. It is reconstructing context: what changed, what the telemetry is saying, which of the last twenty deploys is implicated, and where in the code the failing path lives. That reconstruction is exactly what an agent does well and a tired human does slowly.
A delegated resolution loop looks like this:
-
Triage. On a page, an agent pulls the relevant logs, traces, and metrics, correlates them with recent deploys, and produces a first-pass hypothesis with the evidence attached, so the on-call engineer starts from a diagnosis instead of a blank dashboard.
-
Root-cause correlation. The agent ties telemetry back to specific code and to the change that likely introduced the regression, holding both the runtime signal and the source in context at once.
-
Proposed remediation. For well-understood failure classes, the agent drafts the fix or the rollback as a reviewable change. A human decides whether to ship it.
-
Post-incident capture. The agent drafts the timeline and postmortem from the actual signals, and (most valuably) the regression test that would have caught it, which feeds straight back into stage 3.
The boundary is the same as everywhere else, only sharper because the stakes are higher: the agent accelerates diagnosis and drafts the remediation; a human authorises anything that changes production during an incident. Done well, this is where mean-time-to-resolve drops the most.
What the team actually does now
Across all five stages, the same shift recurs: engineers move from producing every artefact themselves to specifying, reviewing, and orchestrating the work. This is worth making explicit, because it is the part of the transition that unsettles people, and the part that determines whether the change sticks.
-
Engineers spend less time typing implementation and more time writing precise specifications, reviewing agent output critically, and designing the checks that make output reliable. Judgment, taste, and system understanding become the scarce skills; raw typing speed stops mattering.
-
Senior engineers and leads spend their time on the conventions, review gates, eval rubrics, and workflow habits that separate reliable agent use from noisy agent use. The highest-value thing a lead does in an agentic team is keep the shared context and checks healthy.
-
The orchestration skill (running several agents in parallel across tasks, integrating their output, and intervening on exceptions) becomes a genuine competency, not an afterthought. It is closer to running a small team than to coding, and not everyone takes to it at the same pace.
Be candid with the team that this is a change in what the job is, not just in the tools. The engineers who thrive are the ones who were always strong on design and review; the adjustment is harder for those whose identity was tied to authoring code by hand. Make the move voluntary stage by stage, coach it, and let the baseline numbers, not mandates, make the case.
Make it visible to managers
A workflow change managers cannot see is hard to keep funded. The old proxies engineering managers leaned on (commits, lines of code, hours at the keyboard) were always weak, and with coding agents they are actively misleading: an agent can produce a thousand lines while the engineer's real contribution was the spec and the review that made those lines correct. If you change how the work happens without changing what leadership can see, you leave them to judge the new workflow with the old instruments.
The fix is to give managers a view built on the same five baseline metrics you captured at the start, tracked over time, plus a small number of agent-specific signals:
-
Movement on the baseline. Ticket-to-first-PR time, mechanical share of review, coverage on changed lines, mean time to green, and mean time to resolve, trended week over week. This is the evidence that each stage actually worked, and it is the answer to "is this paying off?"
-
Where agents touch the work. Which stages (planning, code, tests, review, DevOps, incidents) agents are actually involved in, and which they are not. The gaps are usually more informative than the totals, and they tell a manager exactly where the next investment goes.
-
Unit economics. Cost per merged PR or per shipped feature, not a raw token bill. This turns "we spend a lot on tools" into a number leadership can argue about honestly.
-
Workflow health. The share of work stuck in expensive patterns like the debug loop, so leads can coach against them and managers can see the waste shrink.
Two principles keep this honest. The unit of analysis is the team, the use case, and the stage, never the individual engineer: the moment a view like this becomes a personal productivity scorecard, people work around it and the data rots. And every metric has to connect to cost, risk, or delivery, which rules out vanity numbers. We built Atlas because this cross-tool view does not exist in any single vendor dashboard; the principle holds whether or not you use it. Give managers this visibility and budget conversations become evidence-based instead of anecdotal.
Common failure modes
The ways this goes wrong are consistent enough to list:
-
Tooling without process. Handing out assistants and declaring victory. Velocity is uneven, nobody can attribute it, and the lifecycle never actually changes. This is the default failure and the reason this article is staged the way it is.
-
Skipping Stage 0. Pointing agents at an illegible codebase and blaming the tool for mediocre output. The missing context is usually the bottleneck.
-
Scaling before quality. Letting agents produce a high volume of code before the gates exist, so the review queue becomes the bottleneck and bugs leak through. More code, more bugs, more review load than you started with.
-
Forcing uniformity. Insisting every stage reach the same maturity level at the same time. Different stages mature at different rates; that is fine, and fighting it wastes effort.
-
Removing the human from the wrong place. Delegating the judgment calls (product trade-offs, security decisions, production promotion) instead of the legwork.
Get the sequence right and the way work moves through the team changes without breaking the team: planning produces specifications, implementation runs against reviewed plans, quality gates catch routine failures, delivery keeps pace, and incidents resolve faster because context reconstruction is automated. The numbers you baselined in week one are how you prove it.
References and further reading
-
The agentic workflow patterns we see work in practice - the recurring step sequences that separate high-merge, low-cost workflows from expensive ones.
-
The AI adoption checklist - the operating-model groundwork that sits underneath any of this.
-
Golden datasets for testing AI - how to build the evaluation sets that make Stage 3 work when your product itself uses models.
On the wider practice, these outside references informed the playbook above:
-
Anthropic, Building effective agents - the distinction between assisted workflows and delegated agents that underpins the maturity model, with patterns for where to keep a human in the loop.
-
Anthropic, Claude Code: best practices for agentic coding - concrete guidance on the context files, conventions, and command setup that make Stage 0 (codebase legibility) work in practice.
-
Martin Fowler / Birgitta Böckeler, Exploring Generative AI - an ongoing, sceptical field report on what AI-assisted engineering actually changes day to day, and where it does not help.
-
Addy Osmani, The 70% problem: hard truths about AI-assisted coding - why the last stretch of agent output needs human judgment, which is the argument for the quality gates in Stage 3.
-
DORA, the four keys and DevOps research - the delivery metrics (lead time, deploy frequency, time to restore, change-fail rate) behind the baseline numbers and Stage 4.
-
Forsgren et al., The SPACE of developer productivity - why single-number productivity proxies mislead, which is the basis for the manager-visibility section's team-and-outcome framing.
-
GitHub, Quantifying Copilot's impact on developer productivity and happiness - early empirical data on assisted coding, useful as a baseline for what tooling alone does and does not move.
-
McKinsey, Unleashing developer productivity with generative AI - a leadership-facing view of where gains concentrate across the lifecycle, and why they vary so much by task.
If you are planning the move from manual coding to agent-assisted delivery and want help sequencing it for your team, get in touch. We will look at where the manual work actually is, where the risk sits, and which stage is worth changing first.