Test Generation with Large Language Models
Test Generation with Large Language Models
Writing tests is essential but tedious. Most developers understand testing best practices but struggle to maintain comprehensive coverage while shipping features. Large Language Models can help by generating test scaffolding, identifying edge cases, and creating test data—freeing developers to focus on business logic.
After generating thousands of tests with LLMs across multiple codebases, I’ve learned what works. Here are the patterns that deliver production-quality test suites.
Why LLMs Excel at Test Generation
Tests follow predictable patterns, making them ideal for LLM generation:
What Makes Tests LLM-Friendly
- Structured format - Setup, execution, assertion
- Repetitive patterns - Similar tests with varied inputs
- Clear success criteria - Tests either pass or fail
- Documented conventions - Well-established best practices
- Context from code - Function signatures reveal test structure
The Sweet Spot
LLMs are best for:
- Scaffolding - Creating test file structure
- Happy path tests - Basic functionality coverage
- Edge case identification - Finding boundary conditions
- Test data generation - Creating realistic fixtures
- Parameterized tests - Multiple input variations
Basic Test Generation Pattern
Start with a simple function-to-test generator:
from anthropic import Anthropic
from typing import Dict, List
import ast
import textwrap
class TestGenerator:
def __init__(self, api_key: str):
self.client = Anthropic(api_key=api_key)
self.model = "claude-3-5-sonnet-20241022"
def generate_tests(
self,
source_code: str,
test_framework: str = "pytest"
) -> str:
"""Generate tests for given source code"""
prompt = self._build_prompt(source_code, test_framework)
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.3, # Lower for consistency
messages=[{
"role": "user",
"content": prompt
}]
)
return response.content[0].text
def _build_prompt(self, source_code: str, framework: str) -> str:
"""Build test generation prompt"""
return f"""Generate comprehensive unit tests for this code.
Source Code:
```python
{source_code}
Requirements:
- Use {framework} framework
- Test happy paths and edge cases
- Include parameterized tests where appropriate
- Test error handling
- Use descriptive test names
- Add docstrings explaining what each test validates
Generate only the test code, no explanations. """
Usage
generator = TestGenerator(api_key=“your-key”)
source = """ def calculate_discount(price: float, discount_percent: int) -> float: if price < 0: raise ValueError(“Price cannot be negative”) if discount_percent < 0 or discount_percent > 100: raise ValueError(“Discount must be between 0 and 100”)
discount_amount = price * (discount_percent / 100)
return price - discount_amount
"""
tests = generator.generate_tests(source) print(tests)
## Advanced Pattern: Context-Aware Generation
Feed the LLM existing tests for style consistency:
```python
class ContextAwareTestGenerator(TestGenerator):
def generate_tests(
self,
source_code: str,
existing_tests: List[str] = None,
test_framework: str = "pytest"
) -> str:
"""Generate tests matching existing style"""
prompt = self._build_contextual_prompt(
source_code,
existing_tests,
test_framework
)
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
return response.content[0].text
def _build_contextual_prompt(
self,
source_code: str,
existing_tests: List[str],
framework: str
) -> str:
"""Build prompt with style examples"""
context = ""
if existing_tests:
examples = "\n\n".join(existing_tests[:3]) # Limit context
context = f"""
Example tests from this codebase (match this style):
```python
{examples}
"""
return f"""Generate unit tests for this code following the project's conventions.
{context}
Source Code to Test:
{source_code}
Requirements:
- Use {framework}
- Match the style and patterns from examples
- Comprehensive coverage including edge cases
- Clear, descriptive test names
- Include fixtures if appropriate
Generate only test code. """
## Intelligent Test Case Discovery
Let the LLM identify what needs testing:
```python
class IntelligentTestGenerator(TestGenerator):
def analyze_test_needs(self, source_code: str) -> Dict:
"""Identify test scenarios before generation"""
prompt = f"""Analyze this code and identify test scenarios.
Source Code:
```python
{source_code}
Provide:
- Happy path scenarios
- Edge cases and boundary conditions
- Error cases
- Integration points
- Performance considerations
Format as JSON: {{ “happy_paths”: [“scenario 1”, “scenario 2”], “edge_cases”: [“edge 1”, “edge 2”], “error_cases”: [“error 1”, “error 2”], “integration_points”: [“point 1”], “performance_concerns”: [“concern 1”] }} """
response = self.client.messages.create(
model=self.model,
max_tokens=2048,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
import json
# Extract JSON from response
text = response.content[0].text
json_match = re.search(r'\{.*\}', text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
return {}
def generate_targeted_tests(
self,
source_code: str,
scenarios: Dict
) -> str:
"""Generate tests for specific scenarios"""
prompt = f"""Generate tests for these specific scenarios.
Source Code:
{source_code}
Test Scenarios: {json.dumps(scenarios, indent=2)}
Create a test for each scenario. Be thorough and explicit. """
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
return response.content[0].text
Usage
generator = IntelligentTestGenerator(api_key=“your-key”)
First, analyze what to test
scenarios = generator.analyze_test_needs(source_code) print(“Identified scenarios:”, scenarios)
Then generate targeted tests
tests = generator.generate_targeted_tests(source_code, scenarios)
## Fixture and Test Data Generation
Generate realistic test data:
```python
class TestDataGenerator:
def __init__(self, api_key: str):
self.client = Anthropic(api_key=api_key)
self.model = "claude-3-5-sonnet-20241022"
def generate_fixtures(
self,
schema: Dict,
count: int = 5,
realistic: bool = True
) -> List[Dict]:
"""Generate test fixtures matching schema"""
prompt = f"""Generate {count} test fixtures matching this schema.
Schema:
{json.dumps(schema, indent=2)}
Requirements:
- Realistic data (names, emails, dates, etc.)
- Cover edge cases (empty strings, max lengths, special chars)
- Include valid and invalid examples
- Variety in data
Return as JSON array.
"""
response = self.client.messages.create(
model=self.model,
max_tokens=3072,
temperature=0.7, # Higher for variety
messages=[{
"role": "user",
"content": prompt
}]
)
text = response.content[0].text
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
return []
# Usage
generator = TestDataGenerator(api_key="your-key")
user_schema = {
"name": "string (max 100 chars)",
"email": "string (valid email format)",
"age": "integer (0-150)",
"role": "enum (admin, user, guest)"
}
fixtures = generator.generate_fixtures(user_schema, count=10)
# Generate pytest fixture code
print("@pytest.fixture")
print("def user_data():")
print(f" return {fixtures}")
Parameterized Test Generation
Generate comprehensive parameter matrices:
class ParameterizedTestGenerator(TestGenerator):
def generate_parameterized_tests(
self,
function_code: str,
parameter_ranges: Dict
) -> str:
"""Generate parameterized tests with edge cases"""
prompt = f"""Generate parameterized pytest tests for this function.
Function:
```python
{function_code}
Parameter Ranges: {json.dumps(parameter_ranges, indent=2)}
Create pytest.mark.parametrize tests covering:
- Typical values
- Boundary values
- Invalid values (expecting errors)
- Special cases (None, empty, zero, etc.)
Use descriptive test IDs for each parameter set. """
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
return response.content[0].text
Usage
generator = ParameterizedTestGenerator(api_key=“your-key”)
function = """ def divide(a: float, b: float) -> float: if b == 0: raise ZeroDivisionError(“Cannot divide by zero”) return a / b """
param_ranges = { “a”: “any float including negatives, zero, infinity”, “b”: “any float except zero, include boundaries” }
tests = generator.generate_parameterized_tests(function, param_ranges)
## Integration Test Generation
Generate tests for API endpoints:
```python
class APITestGenerator:
def __init__(self, api_key: str):
self.client = Anthropic(api_key=api_key)
self.model = "claude-3-5-sonnet-20241022"
def generate_api_tests(
self,
endpoint_spec: Dict,
framework: str = "pytest"
) -> str:
"""Generate API integration tests"""
prompt = f"""Generate integration tests for this API endpoint.
Endpoint Specification:
{json.dumps(endpoint_spec, indent=2)}
Generate tests for:
1. Successful requests (200 responses)
2. Invalid input (400 responses)
3. Authentication/authorization
4. Rate limiting
5. Idempotency (if applicable)
6. Error handling
Use {framework} with requests library.
Include setup/teardown for test data.
"""
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
return response.content[0].text
# Usage
generator = APITestGenerator(api_key="your-key")
spec = {
"path": "/api/users",
"method": "POST",
"auth": "Bearer token required",
"request_body": {
"name": "string (required)",
"email": "string (required, valid email)",
"role": "string (optional, default: user)"
},
"responses": {
"201": "User created successfully",
"400": "Invalid input",
"401": "Unauthorized",
"409": "User already exists"
}
}
tests = generator.generate_api_tests(spec)
Quality Control: Validating Generated Tests
Don’t trust generated tests blindly:
import subprocess
import ast
from typing import Tuple, List
class TestValidator:
def __init__(self):
self.required_patterns = [
r'def test_', # Test functions
r'assert ', # Assertions
]
def validate_generated_tests(
self,
test_code: str,
source_file: str
) -> Tuple[bool, List[str]]:
"""Validate generated test code"""
issues = []
# 1. Check syntax
try:
ast.parse(test_code)
except SyntaxError as e:
issues.append(f"Syntax error: {e}")
return False, issues
# 2. Check for test functions
if 'def test_' not in test_code:
issues.append("No test functions found")
# 3. Check for assertions
if 'assert' not in test_code and 'raises' not in test_code:
issues.append("No assertions found")
# 4. Try to run tests
run_success, run_output = self._run_tests(test_code, source_file)
if not run_success:
issues.append(f"Tests failed to run: {run_output}")
return len(issues) == 0, issues
def _run_tests(
self,
test_code: str,
source_file: str
) -> Tuple[bool, str]:
"""Attempt to run generated tests"""
# Write test to temporary file
import tempfile
with tempfile.NamedTemporaryFile(
mode='w',
suffix='_test.py',
delete=False
) as f:
f.write(test_code)
test_file = f.name
try:
result = subprocess.run(
['pytest', test_file, '-v'],
capture_output=True,
text=True,
timeout=30
)
return result.returncode == 0, result.stdout + result.stderr
except subprocess.TimeoutExpired:
return False, "Tests timed out"
except Exception as e:
return False, str(e)
finally:
import os
os.unlink(test_file)
# Usage
validator = TestValidator()
valid, issues = validator.validate_generated_tests(generated_tests, source_file)
if not valid:
print("Test validation failed:")
for issue in issues:
print(f" - {issue}")
Iterative Refinement
Fix failing tests automatically:
class TestRefiner(TestGenerator):
def refine_failing_tests(
self,
test_code: str,
source_code: str,
error_output: str,
max_iterations: int = 3
) -> str:
"""Fix failing tests iteratively"""
current_tests = test_code
for iteration in range(max_iterations):
prompt = f"""These tests are failing. Fix them.
Source Code:
```python
{source_code}
Current Tests:
{current_tests}
Error Output:
{error_output}
Fix the tests to:
- Match actual function behavior
- Use correct assertions
- Handle exceptions properly
- Import necessary modules
Return only the corrected test code. """
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
fixed_tests = response.content[0].text
# Validate fixes
validator = TestValidator()
valid, issues = validator.validate_generated_tests(
fixed_tests,
source_code
)
if valid:
print(f"Tests fixed in {iteration + 1} iterations")
return fixed_tests
current_tests = fixed_tests
error_output = "\n".join(issues)
print(f"Could not fix tests after {max_iterations} iterations")
return current_tests
## Coverage-Driven Generation
Generate tests to hit uncovered lines:
```python
import coverage
import json
class CoverageGuidedGenerator(TestGenerator):
def generate_for_coverage(
self,
source_file: str,
existing_test_file: str,
target_coverage: float = 0.9
) -> str:
"""Generate tests to improve coverage"""
# Run coverage analysis
cov = coverage.Coverage()
cov.start()
# Import and run existing tests
import importlib.util
spec = importlib.util.spec_from_file_location("tests", existing_test_file)
test_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(test_module)
cov.stop()
cov.save()
# Get uncovered lines
analysis = cov.analysis(source_file)
uncovered_lines = analysis[2] # Missing lines
if not uncovered_lines:
print("Full coverage achieved!")
return ""
# Read source to get context
with open(source_file, 'r') as f:
source_lines = f.readlines()
# Build context around uncovered lines
uncovered_code = self._extract_uncovered_context(
source_lines,
uncovered_lines
)
# Generate tests for uncovered code
prompt = f"""Generate tests to cover these uncovered code sections.
Uncovered Code:
```python
{uncovered_code}
Full Source File:
{''.join(source_lines)}
Current Coverage: {analysis[1] / len(source_lines):.0%} Target Coverage: {target_coverage:.0%}
Generate tests that will execute the uncovered lines. Focus on the specific conditions needed to reach that code. """
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
return response.content[0].text
def _extract_uncovered_context(
self,
source_lines: List[str],
uncovered_lines: List[int],
context_size: int = 5
) -> str:
"""Extract uncovered code with surrounding context"""
sections = []
for line_num in uncovered_lines:
start = max(0, line_num - context_size - 1)
end = min(len(source_lines), line_num + context_size)
section = f"Lines {start+1}-{end}:\n"
for i in range(start, end):
marker = " → " if i == line_num - 1 else " "
section += f"{marker}{i+1:4d}: {source_lines[i]}"
sections.append(section)
return "\n\n".join(sections)
## Complete CI/CD Integration
Automate test generation in CI:
```python
# generate_tests.py
import sys
import os
from pathlib import Path
def main():
# Get changed files from git
changed_files = subprocess.check_output(
['git', 'diff', '--name-only', 'HEAD~1', 'HEAD'],
text=True
).splitlines()
# Filter Python source files
source_files = [
f for f in changed_files
if f.endswith('.py') and not f.endswith('_test.py')
]
generator = IntelligentTestGenerator(
api_key=os.environ['ANTHROPIC_API_KEY']
)
for source_file in source_files:
print(f"Generating tests for {source_file}...")
# Read source
with open(source_file, 'r') as f:
source_code = f.read()
# Check if test file exists
test_file = source_file.replace('.py', '_test.py')
existing_tests = []
if os.path.exists(test_file):
with open(test_file, 'r') as f:
existing_tests = [f.read()]
# Generate tests
tests = generator.generate_tests(
source_code,
existing_tests=existing_tests
)
# Validate
validator = TestValidator()
valid, issues = validator.validate_generated_tests(
tests,
source_code
)
if valid:
# Append to test file
with open(test_file, 'a') as f:
f.write(f"\n\n# Auto-generated tests\n{tests}")
print(f" ✓ Added tests to {test_file}")
else:
print(f" ✗ Test generation failed: {issues}")
sys.exit(1)
if __name__ == '__main__':
main()
GitHub Action
# .github/workflows/generate-tests.yml
name: Generate Tests
on:
pull_request:
paths:
- '**.py'
jobs:
generate-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 2
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install anthropic pytest coverage
- name: Generate tests
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: python generate_tests.py
- name: Run generated tests
run: pytest -v
- name: Check coverage
run: |
coverage run -m pytest
coverage report --fail-under=80
- name: Commit generated tests
if: success()
run: |
git config user.name "Test Bot"
git config user.email "bot@example.com"
git add *_test.py
git commit -m "Add auto-generated tests" || exit 0
git push
Best Practices
1. Always Validate
Never commit generated tests without running them:
def safe_test_generation(source_code: str) -> Optional[str]:
"""Generate and validate tests"""
generator = TestGenerator(api_key=os.environ['ANTHROPIC_API_KEY'])
validator = TestValidator()
tests = generator.generate_tests(source_code)
valid, issues = validator.validate_generated_tests(tests, source_code)
if not valid:
print("Generated tests are invalid:")
for issue in issues:
print(f" - {issue}")
return None
return tests
2. Human Review Required
LLMs make mistakes. Always review:
# Add marker comments
generated_tests = f"""
# WARNING: AUTO-GENERATED TESTS
# Review carefully before committing
# Generated on {datetime.now().isoformat()}
{tests}
"""
3. Start Small
Begin with simple functions:
def should_generate_tests(function_code: str) -> bool:
"""Decide if function is suitable for AI test generation"""
# Skip complex functions initially
if 'async def' in function_code:
return False
# Skip functions with many dependencies
import_count = function_code.count('import')
if import_count > 5:
return False
# Good candidates: pure functions, simple logic
return True
4. Measure Impact
Track coverage improvements:
def measure_coverage_improvement(
before_coverage: float,
after_coverage: float,
test_count: int
):
"""Log test generation impact"""
improvement = after_coverage - before_coverage
metrics = {
'coverage_before': before_coverage,
'coverage_after': after_coverage,
'improvement': improvement,
'tests_generated': test_count,
'timestamp': datetime.now().isoformat()
}
# Log to your analytics platform
print(f"Coverage improved by {improvement:.1%} with {test_count} tests")
Key Takeaways
Effective AI test generation requires:
- Clear prompts - Specify framework, patterns, edge cases
- Validation - Always run and verify generated tests
- Iteration - Fix failing tests automatically
- Context - Provide existing tests for style consistency
- Human oversight - Review before committing
LLMs won’t replace test engineers, but they’ll eliminate the tedious parts. Start with simple unit tests, validate everything, and gradually expand to integration tests.
Resources
Automating the tedious, so developers can focus on what matters. One test at a time.