Skip to main content

Test Generation with Large Language Models

Ryan Dahlberg
Ryan Dahlberg
December 5, 2025 13 min read
Share:
Test Generation with Large Language Models

Test Generation with Large Language Models

Writing tests is essential but tedious. Most developers understand testing best practices but struggle to maintain comprehensive coverage while shipping features. Large Language Models can help by generating test scaffolding, identifying edge cases, and creating test data—freeing developers to focus on business logic.

After generating thousands of tests with LLMs across multiple codebases, I’ve learned what works. Here are the patterns that deliver production-quality test suites.

Why LLMs Excel at Test Generation

Tests follow predictable patterns, making them ideal for LLM generation:

What Makes Tests LLM-Friendly

  • Structured format - Setup, execution, assertion
  • Repetitive patterns - Similar tests with varied inputs
  • Clear success criteria - Tests either pass or fail
  • Documented conventions - Well-established best practices
  • Context from code - Function signatures reveal test structure

The Sweet Spot

LLMs are best for:

  1. Scaffolding - Creating test file structure
  2. Happy path tests - Basic functionality coverage
  3. Edge case identification - Finding boundary conditions
  4. Test data generation - Creating realistic fixtures
  5. Parameterized tests - Multiple input variations

Basic Test Generation Pattern

Start with a simple function-to-test generator:

from anthropic import Anthropic
from typing import Dict, List
import ast
import textwrap

class TestGenerator:
    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)
        self.model = "claude-3-5-sonnet-20241022"

    def generate_tests(
        self,
        source_code: str,
        test_framework: str = "pytest"
    ) -> str:
        """Generate tests for given source code"""
        prompt = self._build_prompt(source_code, test_framework)

        response = self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            temperature=0.3,  # Lower for consistency
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )

        return response.content[0].text

    def _build_prompt(self, source_code: str, framework: str) -> str:
        """Build test generation prompt"""
        return f"""Generate comprehensive unit tests for this code.

Source Code:
```python
{source_code}

Requirements:

  • Use {framework} framework
  • Test happy paths and edge cases
  • Include parameterized tests where appropriate
  • Test error handling
  • Use descriptive test names
  • Add docstrings explaining what each test validates

Generate only the test code, no explanations. """

Usage

generator = TestGenerator(api_key=“your-key”)

source = """ def calculate_discount(price: float, discount_percent: int) -> float: if price < 0: raise ValueError(“Price cannot be negative”) if discount_percent < 0 or discount_percent > 100: raise ValueError(“Discount must be between 0 and 100”)

discount_amount = price * (discount_percent / 100)
return price - discount_amount

"""

tests = generator.generate_tests(source) print(tests)


## Advanced Pattern: Context-Aware Generation

Feed the LLM existing tests for style consistency:

```python
class ContextAwareTestGenerator(TestGenerator):
    def generate_tests(
        self,
        source_code: str,
        existing_tests: List[str] = None,
        test_framework: str = "pytest"
    ) -> str:
        """Generate tests matching existing style"""
        prompt = self._build_contextual_prompt(
            source_code,
            existing_tests,
            test_framework
        )

        response = self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            temperature=0.3,
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )

        return response.content[0].text

    def _build_contextual_prompt(
        self,
        source_code: str,
        existing_tests: List[str],
        framework: str
    ) -> str:
        """Build prompt with style examples"""
        context = ""
        if existing_tests:
            examples = "\n\n".join(existing_tests[:3])  # Limit context
            context = f"""
Example tests from this codebase (match this style):
```python
{examples}

"""

    return f"""Generate unit tests for this code following the project's conventions.

{context}

Source Code to Test:

{source_code}

Requirements:

  • Use {framework}
  • Match the style and patterns from examples
  • Comprehensive coverage including edge cases
  • Clear, descriptive test names
  • Include fixtures if appropriate

Generate only test code. """


## Intelligent Test Case Discovery

Let the LLM identify what needs testing:

```python
class IntelligentTestGenerator(TestGenerator):
    def analyze_test_needs(self, source_code: str) -> Dict:
        """Identify test scenarios before generation"""
        prompt = f"""Analyze this code and identify test scenarios.

Source Code:
```python
{source_code}

Provide:

  1. Happy path scenarios
  2. Edge cases and boundary conditions
  3. Error cases
  4. Integration points
  5. Performance considerations

Format as JSON: {{ “happy_paths”: [“scenario 1”, “scenario 2”], “edge_cases”: [“edge 1”, “edge 2”], “error_cases”: [“error 1”, “error 2”], “integration_points”: [“point 1”], “performance_concerns”: [“concern 1”] }} """

    response = self.client.messages.create(
        model=self.model,
        max_tokens=2048,
        temperature=0.3,
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )

    import json
    # Extract JSON from response
    text = response.content[0].text
    json_match = re.search(r'\{.*\}', text, re.DOTALL)
    if json_match:
        return json.loads(json_match.group())
    return {}

def generate_targeted_tests(
    self,
    source_code: str,
    scenarios: Dict
) -> str:
    """Generate tests for specific scenarios"""
    prompt = f"""Generate tests for these specific scenarios.

Source Code:

{source_code}

Test Scenarios: {json.dumps(scenarios, indent=2)}

Create a test for each scenario. Be thorough and explicit. """

    response = self.client.messages.create(
        model=self.model,
        max_tokens=4096,
        temperature=0.3,
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )

    return response.content[0].text

Usage

generator = IntelligentTestGenerator(api_key=“your-key”)

First, analyze what to test

scenarios = generator.analyze_test_needs(source_code) print(“Identified scenarios:”, scenarios)

Then generate targeted tests

tests = generator.generate_targeted_tests(source_code, scenarios)


## Fixture and Test Data Generation

Generate realistic test data:

```python
class TestDataGenerator:
    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)
        self.model = "claude-3-5-sonnet-20241022"

    def generate_fixtures(
        self,
        schema: Dict,
        count: int = 5,
        realistic: bool = True
    ) -> List[Dict]:
        """Generate test fixtures matching schema"""
        prompt = f"""Generate {count} test fixtures matching this schema.

Schema:
{json.dumps(schema, indent=2)}

Requirements:
- Realistic data (names, emails, dates, etc.)
- Cover edge cases (empty strings, max lengths, special chars)
- Include valid and invalid examples
- Variety in data

Return as JSON array.
"""

        response = self.client.messages.create(
            model=self.model,
            max_tokens=3072,
            temperature=0.7,  # Higher for variety
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )

        text = response.content[0].text
        json_match = re.search(r'\[.*\]', text, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        return []

# Usage
generator = TestDataGenerator(api_key="your-key")

user_schema = {
    "name": "string (max 100 chars)",
    "email": "string (valid email format)",
    "age": "integer (0-150)",
    "role": "enum (admin, user, guest)"
}

fixtures = generator.generate_fixtures(user_schema, count=10)

# Generate pytest fixture code
print("@pytest.fixture")
print("def user_data():")
print(f"    return {fixtures}")

Parameterized Test Generation

Generate comprehensive parameter matrices:

class ParameterizedTestGenerator(TestGenerator):
    def generate_parameterized_tests(
        self,
        function_code: str,
        parameter_ranges: Dict
    ) -> str:
        """Generate parameterized tests with edge cases"""
        prompt = f"""Generate parameterized pytest tests for this function.

Function:
```python
{function_code}

Parameter Ranges: {json.dumps(parameter_ranges, indent=2)}

Create pytest.mark.parametrize tests covering:

  • Typical values
  • Boundary values
  • Invalid values (expecting errors)
  • Special cases (None, empty, zero, etc.)

Use descriptive test IDs for each parameter set. """

    response = self.client.messages.create(
        model=self.model,
        max_tokens=4096,
        temperature=0.3,
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )

    return response.content[0].text

Usage

generator = ParameterizedTestGenerator(api_key=“your-key”)

function = """ def divide(a: float, b: float) -> float: if b == 0: raise ZeroDivisionError(“Cannot divide by zero”) return a / b """

param_ranges = { “a”: “any float including negatives, zero, infinity”, “b”: “any float except zero, include boundaries” }

tests = generator.generate_parameterized_tests(function, param_ranges)


## Integration Test Generation

Generate tests for API endpoints:

```python
class APITestGenerator:
    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)
        self.model = "claude-3-5-sonnet-20241022"

    def generate_api_tests(
        self,
        endpoint_spec: Dict,
        framework: str = "pytest"
    ) -> str:
        """Generate API integration tests"""
        prompt = f"""Generate integration tests for this API endpoint.

Endpoint Specification:
{json.dumps(endpoint_spec, indent=2)}

Generate tests for:
1. Successful requests (200 responses)
2. Invalid input (400 responses)
3. Authentication/authorization
4. Rate limiting
5. Idempotency (if applicable)
6. Error handling

Use {framework} with requests library.
Include setup/teardown for test data.
"""

        response = self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            temperature=0.3,
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )

        return response.content[0].text

# Usage
generator = APITestGenerator(api_key="your-key")

spec = {
    "path": "/api/users",
    "method": "POST",
    "auth": "Bearer token required",
    "request_body": {
        "name": "string (required)",
        "email": "string (required, valid email)",
        "role": "string (optional, default: user)"
    },
    "responses": {
        "201": "User created successfully",
        "400": "Invalid input",
        "401": "Unauthorized",
        "409": "User already exists"
    }
}

tests = generator.generate_api_tests(spec)

Quality Control: Validating Generated Tests

Don’t trust generated tests blindly:

import subprocess
import ast
from typing import Tuple, List

class TestValidator:
    def __init__(self):
        self.required_patterns = [
            r'def test_',  # Test functions
            r'assert ',    # Assertions
        ]

    def validate_generated_tests(
        self,
        test_code: str,
        source_file: str
    ) -> Tuple[bool, List[str]]:
        """Validate generated test code"""
        issues = []

        # 1. Check syntax
        try:
            ast.parse(test_code)
        except SyntaxError as e:
            issues.append(f"Syntax error: {e}")
            return False, issues

        # 2. Check for test functions
        if 'def test_' not in test_code:
            issues.append("No test functions found")

        # 3. Check for assertions
        if 'assert' not in test_code and 'raises' not in test_code:
            issues.append("No assertions found")

        # 4. Try to run tests
        run_success, run_output = self._run_tests(test_code, source_file)
        if not run_success:
            issues.append(f"Tests failed to run: {run_output}")

        return len(issues) == 0, issues

    def _run_tests(
        self,
        test_code: str,
        source_file: str
    ) -> Tuple[bool, str]:
        """Attempt to run generated tests"""
        # Write test to temporary file
        import tempfile
        with tempfile.NamedTemporaryFile(
            mode='w',
            suffix='_test.py',
            delete=False
        ) as f:
            f.write(test_code)
            test_file = f.name

        try:
            result = subprocess.run(
                ['pytest', test_file, '-v'],
                capture_output=True,
                text=True,
                timeout=30
            )
            return result.returncode == 0, result.stdout + result.stderr
        except subprocess.TimeoutExpired:
            return False, "Tests timed out"
        except Exception as e:
            return False, str(e)
        finally:
            import os
            os.unlink(test_file)

# Usage
validator = TestValidator()
valid, issues = validator.validate_generated_tests(generated_tests, source_file)

if not valid:
    print("Test validation failed:")
    for issue in issues:
        print(f"  - {issue}")

Iterative Refinement

Fix failing tests automatically:

class TestRefiner(TestGenerator):
    def refine_failing_tests(
        self,
        test_code: str,
        source_code: str,
        error_output: str,
        max_iterations: int = 3
    ) -> str:
        """Fix failing tests iteratively"""
        current_tests = test_code

        for iteration in range(max_iterations):
            prompt = f"""These tests are failing. Fix them.

Source Code:
```python
{source_code}

Current Tests:

{current_tests}

Error Output:

{error_output}

Fix the tests to:

  1. Match actual function behavior
  2. Use correct assertions
  3. Handle exceptions properly
  4. Import necessary modules

Return only the corrected test code. """

        response = self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            temperature=0.3,
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )

        fixed_tests = response.content[0].text

        # Validate fixes
        validator = TestValidator()
        valid, issues = validator.validate_generated_tests(
            fixed_tests,
            source_code
        )

        if valid:
            print(f"Tests fixed in {iteration + 1} iterations")
            return fixed_tests

        current_tests = fixed_tests
        error_output = "\n".join(issues)

    print(f"Could not fix tests after {max_iterations} iterations")
    return current_tests

## Coverage-Driven Generation

Generate tests to hit uncovered lines:

```python
import coverage
import json

class CoverageGuidedGenerator(TestGenerator):
    def generate_for_coverage(
        self,
        source_file: str,
        existing_test_file: str,
        target_coverage: float = 0.9
    ) -> str:
        """Generate tests to improve coverage"""
        # Run coverage analysis
        cov = coverage.Coverage()
        cov.start()

        # Import and run existing tests
        import importlib.util
        spec = importlib.util.spec_from_file_location("tests", existing_test_file)
        test_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(test_module)

        cov.stop()
        cov.save()

        # Get uncovered lines
        analysis = cov.analysis(source_file)
        uncovered_lines = analysis[2]  # Missing lines

        if not uncovered_lines:
            print("Full coverage achieved!")
            return ""

        # Read source to get context
        with open(source_file, 'r') as f:
            source_lines = f.readlines()

        # Build context around uncovered lines
        uncovered_code = self._extract_uncovered_context(
            source_lines,
            uncovered_lines
        )

        # Generate tests for uncovered code
        prompt = f"""Generate tests to cover these uncovered code sections.

Uncovered Code:
```python
{uncovered_code}

Full Source File:

{''.join(source_lines)}

Current Coverage: {analysis[1] / len(source_lines):.0%} Target Coverage: {target_coverage:.0%}

Generate tests that will execute the uncovered lines. Focus on the specific conditions needed to reach that code. """

    response = self.client.messages.create(
        model=self.model,
        max_tokens=4096,
        temperature=0.3,
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )

    return response.content[0].text

def _extract_uncovered_context(
    self,
    source_lines: List[str],
    uncovered_lines: List[int],
    context_size: int = 5
) -> str:
    """Extract uncovered code with surrounding context"""
    sections = []

    for line_num in uncovered_lines:
        start = max(0, line_num - context_size - 1)
        end = min(len(source_lines), line_num + context_size)

        section = f"Lines {start+1}-{end}:\n"
        for i in range(start, end):
            marker = " → " if i == line_num - 1 else "   "
            section += f"{marker}{i+1:4d}: {source_lines[i]}"

        sections.append(section)

    return "\n\n".join(sections)

## Complete CI/CD Integration

Automate test generation in CI:

```python
# generate_tests.py
import sys
import os
from pathlib import Path

def main():
    # Get changed files from git
    changed_files = subprocess.check_output(
        ['git', 'diff', '--name-only', 'HEAD~1', 'HEAD'],
        text=True
    ).splitlines()

    # Filter Python source files
    source_files = [
        f for f in changed_files
        if f.endswith('.py') and not f.endswith('_test.py')
    ]

    generator = IntelligentTestGenerator(
        api_key=os.environ['ANTHROPIC_API_KEY']
    )

    for source_file in source_files:
        print(f"Generating tests for {source_file}...")

        # Read source
        with open(source_file, 'r') as f:
            source_code = f.read()

        # Check if test file exists
        test_file = source_file.replace('.py', '_test.py')
        existing_tests = []

        if os.path.exists(test_file):
            with open(test_file, 'r') as f:
                existing_tests = [f.read()]

        # Generate tests
        tests = generator.generate_tests(
            source_code,
            existing_tests=existing_tests
        )

        # Validate
        validator = TestValidator()
        valid, issues = validator.validate_generated_tests(
            tests,
            source_code
        )

        if valid:
            # Append to test file
            with open(test_file, 'a') as f:
                f.write(f"\n\n# Auto-generated tests\n{tests}")
            print(f"  ✓ Added tests to {test_file}")
        else:
            print(f"  ✗ Test generation failed: {issues}")
            sys.exit(1)

if __name__ == '__main__':
    main()

GitHub Action

# .github/workflows/generate-tests.yml
name: Generate Tests

on:
  pull_request:
    paths:
      - '**.py'

jobs:
  generate-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 2

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install anthropic pytest coverage

      - name: Generate tests
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python generate_tests.py

      - name: Run generated tests
        run: pytest -v

      - name: Check coverage
        run: |
          coverage run -m pytest
          coverage report --fail-under=80

      - name: Commit generated tests
        if: success()
        run: |
          git config user.name "Test Bot"
          git config user.email "bot@example.com"
          git add *_test.py
          git commit -m "Add auto-generated tests" || exit 0
          git push

Best Practices

1. Always Validate

Never commit generated tests without running them:

def safe_test_generation(source_code: str) -> Optional[str]:
    """Generate and validate tests"""
    generator = TestGenerator(api_key=os.environ['ANTHROPIC_API_KEY'])
    validator = TestValidator()

    tests = generator.generate_tests(source_code)
    valid, issues = validator.validate_generated_tests(tests, source_code)

    if not valid:
        print("Generated tests are invalid:")
        for issue in issues:
            print(f"  - {issue}")
        return None

    return tests

2. Human Review Required

LLMs make mistakes. Always review:

# Add marker comments
generated_tests = f"""
# WARNING: AUTO-GENERATED TESTS
# Review carefully before committing
# Generated on {datetime.now().isoformat()}

{tests}
"""

3. Start Small

Begin with simple functions:

def should_generate_tests(function_code: str) -> bool:
    """Decide if function is suitable for AI test generation"""
    # Skip complex functions initially
    if 'async def' in function_code:
        return False

    # Skip functions with many dependencies
    import_count = function_code.count('import')
    if import_count > 5:
        return False

    # Good candidates: pure functions, simple logic
    return True

4. Measure Impact

Track coverage improvements:

def measure_coverage_improvement(
    before_coverage: float,
    after_coverage: float,
    test_count: int
):
    """Log test generation impact"""
    improvement = after_coverage - before_coverage

    metrics = {
        'coverage_before': before_coverage,
        'coverage_after': after_coverage,
        'improvement': improvement,
        'tests_generated': test_count,
        'timestamp': datetime.now().isoformat()
    }

    # Log to your analytics platform
    print(f"Coverage improved by {improvement:.1%} with {test_count} tests")

Key Takeaways

Effective AI test generation requires:

  1. Clear prompts - Specify framework, patterns, edge cases
  2. Validation - Always run and verify generated tests
  3. Iteration - Fix failing tests automatically
  4. Context - Provide existing tests for style consistency
  5. Human oversight - Review before committing

LLMs won’t replace test engineers, but they’ll eliminate the tedious parts. Start with simple unit tests, validate everything, and gradually expand to integration tests.

Resources


Automating the tedious, so developers can focus on what matters. One test at a time.

#AI Development #Testing #Test Automation #Code Quality #TDD #DevOps