Setting up Claude Code for automated test generation

Why AI-generated tests are a good starting point

When engineering teams ask where to start with AI coding agents, we almost always recommend test generation. It's high-leverage (tests are valuable but tedious), low-risk (generated tests are easy to review and reject), and immediately measurable (coverage numbers move).

Claude Code's custom skills feature lets you codify test generation patterns into reusable commands that your entire team can run consistently. This post walks through how we set it up, what the skill configuration looks like, and what results we've seen.

What Claude Code skills are

Skills are markdown files in your repository's .claude/ directory that define reusable instructions for Claude Code, when an engineer invokes a skill, Claude Code reads the instructions and applies them to the current context. Think of them as codified expertise. The same way you'd write a runbook for a manual process, but executable by an AI agent.

The file structure is straightforward:

.claude/
  skills/
    generate-tests.md
    review-pr.md
    document-api.md
  settings.json

The test generation skill

Here's a simplified version of the test generation skill we deploy. The full version includes more framework-specific patterns, but this captures the core approach.

markdown

# Generate Tests

When asked to generate tests for a file or function:

1. Read the source file and understand the public API
2. Identify the testing framework from package.json or existing test files
3. Generate tests covering:
   - Happy path for each public function/method
   - Edge cases: null inputs, empty arrays, boundary values
   - Error cases: invalid inputs, expected exceptions
   - Integration points: mock external dependencies

## Conventions
- Follow existing test file naming: `*.test.ts` or `*.spec.ts`
- Use the same assertion style as existing tests
- Group tests with describe blocks matching the source structure
- Each test should have a clear, descriptive name
- Do NOT test private/internal functions directly

## Quality checks
- Every test must have at least one assertion
- Tests must be independent (no shared mutable state)
- Mock external services, databases, and network calls
- Use factory functions for test data, not inline objects

The key insight is specificity. A generic prompt like "write tests for this file" produces generic tests. The skill codifies your team's specific conventions: naming patterns, assertion styles, what to mock, and what not to test.

Configuration and access controls

Claude Code's settings file controls what the agent can and cannot do, for test generation, we configure permissions tightly:

json

{
  "permissions": {
    "allow": [
      "read:**/src/**",
      "read:**/test/**",
      "write:**/test/**",
      "read:package.json",
      "read:tsconfig.json"
    ],
    "deny": [
      "write:**/src/**",
      "execute:*"
    ]
  }
}

The agent can read source code and existing tests, and write to the test directory. It cannot modify source code or execute commands. This is intentional. We want the agent to generate test files, not change production code or run arbitrary commands.

The workflow in practice

An engineer's workflow looks like this:

They write or modify a source file as part of normal development.
They run the test generation skill against the changed file.
Claude Code reads the source, checks existing test patterns, and generates a test file.
The engineer reviews the generated tests, adjusts or removes anything that doesn't make sense, and commits.

The review step is critical. AI-generated tests are a draft, not a finished product. Engineers need to verify that the tests actually test meaningful behaviour, not just that the code runs without throwing. We explicitly train teams to look for tests that would pass even if the implementation was wrong. Those are the ones to remove.

What we measure

We track four metrics during and after rollout:

Test coverage delta

The most obvious metric. We measure coverage before the skill is introduced and track the trend weekly. Typical results: teams see a 15-30% increase in line coverage within the first month, with the gains concentrated in code that previously had zero test coverage.

Time to write tests

We compare the time engineers spend writing tests for new features before and after skill adoption. The generated tests aren't always usable as-is, but they provide a starting structure that's faster to review and edit than writing from scratch. Typical reduction: 40-60% less time per test file.

False positive rate

What percentage of generated tests need to be significantly rewritten or discarded? We track this to tune the skill over time. A high false positive rate means the skill instructions aren't specific enough, or the codebase has patterns the model handles poorly. Our target is below 20% discard rate.

PR review cycle impact

Does better test coverage affect review times? In our experience, yes. Reviewers spend less time requesting additional tests when the coverage is already there. This is an indirect metric but a meaningful one for team velocity.

Common pitfalls

Over-mocking

AI models tend to mock aggressively. They'll mock utility functions, standard library methods, and simple data transformations that should just run normally. The skill needs explicit instructions about what to mock (external services, databases, network calls) and what not to mock (pure functions, data transformations, string formatting).

Snapshot testing as a crutch

If your codebase uses snapshot tests, the model will generate them prolifically. Snapshot tests are easy to generate but provide low-value coverage. They test that output hasn't changed, not that output is correct. We include explicit instructions to avoid snapshot tests unless the skill is specifically invoked for UI component testing.

Testing implementation instead of behaviour

Generated tests sometimes test that specific internal methods are called in a specific order, rather than testing that the public API produces the correct output. This creates brittle tests that break when you refactor internals. The skill needs to emphasise behaviour-based testing.

Evolving the skill over time

The skill file is version-controlled alongside your code. As the team uses it and finds patterns that work or don't work, they update the skill. This creates a feedback loop: the AI gets better at generating tests for your specific codebase because the instructions become more specific over time.

We recommend reviewing the skill monthly during the first quarter, after that, updates tend to be driven by framework upgrades, new patterns in the codebase, or specific problem areas that surface during code review.

Results from a recent deployment

In a recent engagement with a B2B software company, we rolled out test generation skills across a 15-person engineering team, after four weeks:

Line coverage increased from 47% to 68% across the main application codebase. Time spent writing tests for new features dropped by approximately 50%. PR review cycles shortened because reviewers no longer needed to request additional test coverage. The discard rate stabilised at around 15%. Meaning 85% of generated tests were usable with minor edits.

The biggest qualitative change was cultural. Engineers who previously treated testing as a chore started treating it as a default step in their workflow, because the cost of generating a first draft was near zero.

If you're considering rolling out AI-assisted test generation and want help structuring the skill, governance, and measurement, book a diagnostic. We'll review your codebase and testing patterns and help you design a skill that fits your team's conventions.