Integration Testing Guide
This guide covers the comprehensive integration test suite for the ADK Agents system, designed following Google ADK integration testing patterns and best practices.
Overview
The integration test suite provides end-to-end validation of the multi-agent system using two complementary approaches:
Traditional Integration Testing
- Agent Lifecycle Management - Complete conversation turns with context management
- Workflow Orchestration - Sequential, parallel, iterative, and human-in-loop patterns
- Context Management - Smart prioritization, cross-turn correlation, and RAG integration
- Tool Orchestration - Advanced tool coordination with error handling
- Performance Verification - Load testing, optimization validation, and stress testing
ADK Evaluation Framework ⭐ NEW
- Behavioral Testing - Real agent behavior validation using evaluation scenarios
- Tool Usage Patterns - Expected tool usage and parameter validation
- Agent Communication - Multi-agent coordination and response quality
- Memory & Persistence - Session continuity and knowledge retention
- Real-World Scenarios - User-like interactions with expected outcomes
Test Suite Architecture
Hybrid Testing Approach
The integration tests combine traditional pytest patterns with modern ADK evaluation scenarios:
Traditional Testing (Phases 1-4)
Phase 1: Foundation Tests
- Agent Lifecycle Tests - Basic conversation turn execution
- Workflow Orchestration - Core workflow pattern validation
- Context Flow - Multi-turn context management
- Token Management - Budget management and optimization
Phase 2: Core Integration Tests
- Smart Prioritization - Content relevance scoring
- Cross-turn Correlation - Conversation relationship detection
- Intelligent Summarization - Context-aware content reduction
- Dynamic Context Expansion - Automatic content discovery
- RAG Integration - Semantic search and indexing
Phase 3: Tool Orchestration Tests
- Sequential Tool Execution - Dependency management
- Parallel Tool Execution - Performance optimization
- Error Handling - Recovery mechanisms
- State Management - Tool coordination
- Complex Workflows - Multi-phase execution
Phase 4: Performance Verification
- Load Testing - Concurrent user simulation
- Performance Comparison - Parallel vs sequential execution
- Memory Optimization - Leak detection and management
- Token Optimization - Counting performance
- Stress Testing - Extreme scenario handling
ADK Evaluation Framework ⭐ NEW
Phase 5: Behavioral Evaluation Tests
- Simple Code Analysis - Basic agent functionality validation
- Sub-Agent Delegation - Hierarchical agent communication
- Tool Usage Patterns - Expected tool execution scenarios
- Multi-Agent Coordination - Workflow orchestration and result aggregation
- Agent Memory & Persistence - Session continuity and knowledge retention
Evaluation Test Format
Evaluation tests use JSON files (.evalset.json
) that define:
- User Queries - Natural language requests
- Expected Tool Usage - Tools and parameters the agent should use
- Agent Responses - Expected communication patterns
- Outcome References - Desired results and behavior
Package Management
Important: This project uses uv
exclusively for all Python package management tasks. Never use pip
directly.
Installation and Setup
# Install dependencies
uv sync --dev
# Install additional test dependencies
uv add --dev pytest-xdist pytest-benchmark
# Install project in development mode
uv pip install -e .
# Check installed packages
uv pip list
Running Tests with uv
All test commands should be prefixed with uv run
:
# Basic test execution
uv run pytest
# With specific options
uv run pytest --cov=src --cov-report=html
# Run specific test files
uv run pytest tests/integration/test_agent_lifecycle.py
# Run with markers
uv run pytest -m "integration and foundation"
Why uv?
- Consistency: Ensures all team members use the same package versions
- Performance: Faster dependency resolution and installation
- Reliability: Better handling of dependency conflicts
- Modern: State-of-the-art Python package management
ADK Evaluation Framework ⭐ NEW
Overview
The ADK Evaluation Framework provides behavioral testing that validates how agents actually behave in real-world scenarios, complementing traditional structural testing. This approach follows the official Google ADK evaluation patterns.
Key Benefits
- Real Behavior Testing - Tests actual agent responses, not just structure
- User-Centric Scenarios - Tests mirror actual user interactions
- Tool Usage Validation - Ensures agents use tools correctly
- Response Quality - Validates communication patterns and outcomes
- Future-Proof - Ready for official ADK evaluation module integration
Evaluation Test Structure
Evaluation tests are stored in tests/integration/evaluation_tests/
as JSON files:
tests/integration/evaluation_tests/
├── simple_code_analysis.evalset.json
├── sub_agent_delegation.evalset.json
├── tool_usage.evalset.json
├── multi_agent_coordination.evalset.json
├── agent_memory_persistence.evalset.json
└── test_config.json
Creating Evaluation Tests
Basic Evaluation Scenario Format
{
"test_name": "Agent Behavior Evaluation",
"description": "Tests agent responses to real-world scenarios",
"version": "1.0.0",
"test_scenarios": [
{
"scenario_id": "basic_code_analysis",
"description": "Test basic code analysis capabilities",
"query": "Analyze this Python code for potential issues: def calculate(x, y): return x/y",
"expected_tool_use": [
{
"tool_name": "code_analyzer",
"inputs": {
"code": "def calculate(x, y): return x/y",
"language": "python"
}
}
],
"expected_intermediate_agent_responses": [
{
"agent_type": "code_quality_agent",
"response_pattern": "division by zero vulnerability",
"coordination_actions": ["risk_assessment", "recommendation_generation"]
}
],
"reference": "Agent should identify division by zero risk and suggest input validation"
}
],
"evaluation_criteria": {
"tool_usage_accuracy": "Agent should use appropriate tools with correct parameters",
"response_quality": "Agent should provide actionable insights and recommendations"
}
}
Multi-Agent Coordination Example
{
"scenario_id": "workflow_orchestration",
"description": "Test coordination between multiple agents",
"query": "I need to implement a new user authentication feature. Please coordinate between design, development, and testing teams.",
"expected_tool_use": [
{
"tool_name": "workflow_orchestrator",
"inputs": {
"workflow_type": "feature_development",
"agents_required": ["design_pattern_agent", "code_review_agent", "testing_agent"],
"coordination_strategy": "sequential_with_feedback"
}
}
],
"expected_intermediate_agent_responses": [
{
"agent_type": "design_pattern_agent",
"response_pattern": "authentication architecture design",
"coordination_actions": ["state_update", "next_agent_notification"]
},
{
"agent_type": "code_review_agent",
"response_pattern": "implementation review feedback",
"coordination_actions": ["quality_validation", "testing_handoff"]
}
],
"reference": "Workflow should demonstrate proper agent handoffs and collaborative task completion"
}
Memory & Persistence Example
{
"scenario_id": "session_continuity",
"description": "Test session continuity across interactions",
"query": "Remember that I'm working on a Flask web application. We discussed implementing user authentication. Now I need to add password reset functionality.",
"expected_tool_use": [
{
"tool_name": "session_memory_manager",
"inputs": {
"operation": "retrieve_session_context",
"context_keys": ["project_type", "framework", "previous_features"]
}
},
{
"tool_name": "persistent_memory_tool",
"inputs": {
"operation": "load_memory",
"memory_type": "project_context",
"filters": ["flask_authentication", "web_development"]
}
}
],
"expected_intermediate_agent_responses": [
{
"agent_type": "memory_retrieval_agent",
"response_pattern": "retrieved context about Flask project and authentication work",
"coordination_actions": ["context_validation", "continuity_establishment"]
}
],
"reference": "Agent should demonstrate clear continuity from previous conversation, referencing Flask project and authentication implementation"
}
Running Evaluation Tests
Basic Evaluation Test Execution
# Run all evaluation tests
uv run pytest tests/integration/test_adk_evaluation_patterns.py -v
# Run specific evaluation test
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_multi_agent_coordination_evaluation -v
# Run evaluation tests with integration suite
./tests/integration/run_integration_tests.py --suite "ADK Evaluation"
Test Configuration
Configure evaluation criteria in test_config.json
:
{
"criteria": {
"tool_trajectory_avg_score": 0.8,
"response_match_score": 0.7
},
"evaluation_settings": {
"timeout_seconds": 30,
"max_retries": 3,
"parallel_execution": true
}
}
Best Practices for Evaluation Tests
Creating Effective Scenarios
- Use Natural Language Queries - Write queries as users would ask them
- Be Specific About Expected Tools - Define exact tool names and parameters
- Include Realistic Coordination - Show how agents should work together
- Test Edge Cases - Include error scenarios and boundary conditions
- Focus on Behavior - Test what the agent does, not just what it returns
Evaluation Test Patterns
Simple Analysis Pattern
{
"query": "Check this code for bugs: [code sample]",
"expected_tool_use": [{"tool_name": "code_analyzer", "inputs": {"code": "..."}}],
"reference": "Should identify specific issues and suggest fixes"
}
Coordination Pattern
{
"query": "Coordinate a code review with multiple team members",
"expected_tool_use": [{"tool_name": "workflow_orchestrator", "inputs": {"agents": [...]}}],
"reference": "Should demonstrate proper multi-agent coordination"
}
Memory Pattern
{
"query": "Continue working on the project we discussed earlier",
"expected_tool_use": [{"tool_name": "session_memory_manager", "inputs": {"operation": "retrieve_context"}}],
"reference": "Should demonstrate session continuity and context awareness"
}
Integration with Traditional Tests
The evaluation framework complements traditional tests:
- Traditional Tests - Validate structure, mocking, and component integration
- Evaluation Tests - Validate behavior, tool usage, and real-world scenarios
- Combined Coverage - Complete validation of both implementation and behavior
Quick Start
Running All Tests
# Run the complete integration test suite (traditional + evaluation)
./tests/integration/run_integration_tests.py
# Run with detailed output
./tests/integration/run_integration_tests.py --verbose
# Run in parallel mode (faster)
./tests/integration/run_integration_tests.py --parallel
# Test conftest.py fixtures
uv run pytest tests/integration/test_conftest_example.py -v
# Run integration tests with pytest directly
uv run pytest tests/integration/ -m "integration and foundation" -v
Running Evaluation Tests ⭐ NEW
# Run all ADK evaluation tests
uv run pytest tests/integration/test_adk_evaluation_patterns.py -v
# Run specific evaluation test categories
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_multi_agent_coordination_evaluation -v
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_agent_memory_persistence_evaluation -v
# Run evaluation tests with integration suite
./tests/integration/run_integration_tests.py --suite "ADK Evaluation"
# Validate evaluation test files
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_evaluation_test_files_exist -v
Running Specific Test Suites
# Run only foundation tests
./tests/integration/run_integration_tests.py --suite "Foundation"
# Run only performance tests
./tests/integration/run_integration_tests.py --suite "Performance"
# Run only tool orchestration tests
./tests/integration/run_integration_tests.py --suite "Tool Orchestration"
Including Stress Tests
# Run all tests including stress tests
./tests/integration/run_integration_tests.py --stress
Test Files Overview
Traditional Integration Test Files
File | Purpose | Test Count |
---|---|---|
test_agent_lifecycle.py | Agent lifecycle and workflow orchestration | 12+ |
test_context_management_advanced.py | Advanced context management with RAG | 15+ |
test_tool_orchestration_advanced.py | Tool orchestration with error handling | 18+ |
test_performance_verification.py | Performance and load testing | 12+ |
test_conftest_example.py | Fixture usage examples and validation | 26+ |
run_integration_tests.py | Comprehensive test runner | N/A |
ADK Evaluation Test Files ⭐ NEW
File | Purpose | Scenarios |
---|---|---|
test_adk_evaluation_patterns.py | Evaluation framework validation and execution | 11+ |
evaluation_tests/simple_code_analysis.evalset.json | Basic agent functionality scenarios | 3+ |
evaluation_tests/sub_agent_delegation.evalset.json | Agent hierarchy and delegation patterns | 3+ |
evaluation_tests/tool_usage.evalset.json | Tool usage and parameter validation | 3+ |
evaluation_tests/multi_agent_coordination.evalset.json | Multi-agent coordination scenarios | 6+ |
evaluation_tests/agent_memory_persistence.evalset.json | Memory and session continuity scenarios | 7+ |
evaluation_tests/test_config.json | Evaluation criteria and configuration | N/A |
Test Utilities
File | Purpose |
---|---|
tests/fixtures/test_helpers.py | Mock utilities and test fixtures |
tests/conftest.py | Main pytest configuration and shared fixtures |
tests/integration/conftest.py | Integration-specific fixtures and configuration |
Understanding Test Results
Test Output Format
🧪 INTEGRATION TEST SUITE SUMMARY
================================================================================
Total Duration: 45.2s
Total Tests: 57
Passed: 57 ✅
Failed: 0 ❌
Success Rate: 100.0%
Test Suites: 4
📋 TEST SUITE BREAKDOWN:
✅ Foundation Tests: 8/8 passed (100.0%)
✅ Core Integration Tests: 15/15 passed (100.0%)
✅ Tool Orchestration Tests: 18/18 passed (100.0%)
✅ Performance Verification Tests: 16/16 passed (100.0%)
⚡ PERFORMANCE METRICS:
Fastest Test: test_token_counting_performance (0.012s)
Slowest Test: test_load_testing_simulation (8.450s)
Average Test Duration: 0.793s
💡 RECOMMENDATIONS:
🎉 Perfect test suite! All tests passing - excellent work!
Report Files
Test results are automatically saved to test_reports/
with detailed JSON reports including:
- Test execution details with timings and results
- Performance metrics with memory and CPU usage
- Error analysis with detailed failure information
- Recommendations for optimization and improvements
Key Features
Integration-Specific Configuration
The tests/integration/conftest.py
file provides:
- Comprehensive Fixture Library - All fixtures needed for different test phases
- Custom Test Markers - Foundation, core, orchestration, verification phase markers
- Environment Setup - Automatic test environment configuration
- Parametrized Testing - Multiple scenarios for workflow, load, and context testing
- Error Simulation - Configuration for testing error handling scenarios
- Performance Monitoring - Integration with performance testing infrastructure
- Automatic Cleanup - Session and test-level cleanup management
- Skip Conditions - Environment-specific test skipping (performance, stress, load tests)
Advanced Mocking
The test suite includes sophisticated mocking for:
- LLM Clients - Realistic response simulation
- Session States - Multi-agent state management
- Test Workspaces - Isolated test environments
- Tool Execution - Comprehensive tool behavior simulation
Performance Monitoring
Built-in performance monitoring tracks:
- Memory Usage - Real-time memory consumption
- CPU Usage - Processor utilization
- Token Counting - Token processing performance
- Context Assembly - Context generation timing
- Throughput - Operations per second
Error Handling Validation
Comprehensive error scenario testing:
- Recovery Mechanisms - Automatic error recovery
- Retry Logic - Configurable retry strategies
- Fallback Behavior - Graceful degradation
- State Consistency - Error state management
Load Testing
Realistic load testing capabilities:
- Concurrent Users - Multiple simultaneous sessions
- Resource Monitoring - System resource tracking
- Throughput Testing - Performance under load
- Scalability Analysis - System capacity evaluation
Using Conftest.py Fixtures
The integration test suite includes comprehensive fixtures for all testing scenarios. Here’s how to use them:
Basic Fixture Usage
import pytest
@pytest.mark.integration
@pytest.mark.foundation
class TestMyFeature:
def test_with_basic_fixtures(self, mock_llm_client, mock_session_state, test_workspace):
# Use mock LLM client
assert mock_llm_client is not None
assert hasattr(mock_llm_client, 'generate')
# Use mock session state
assert 'agent_coordination' in mock_session_state
assert 'context_state' in mock_session_state
# Use test workspace
assert 'workspace_dir' in test_workspace
Context Management Fixtures
@pytest.mark.integration
@pytest.mark.core
class TestContextManagement:
def test_context_features(self, mock_context_manager, mock_smart_prioritizer, mock_rag_system):
# Add content to context
mock_context_manager.add_code_snippet("test.py", "print('hello')")
mock_context_manager.add_tool_result("test_tool", {"result": "success"})
# Assemble context
context, token_count = mock_context_manager.assemble_context(10000)
assert context is not None
assert token_count > 0
# Use smart prioritizer
snippets = [{"content": "test code", "file_path": "test.py"}]
prioritized = mock_smart_prioritizer.prioritize_code_snippets(snippets, "test context")
assert len(prioritized) == 1
# Use RAG system
rag_results = mock_rag_system.query("test query", top_k=3)
assert len(rag_results) == 3
Agent Fixtures
@pytest.mark.integration
@pytest.mark.foundation
class TestAgents:
def test_agent_types(self, mock_devops_agent, mock_software_engineer_agent, mock_swe_agent):
agents = [mock_devops_agent, mock_software_engineer_agent, mock_swe_agent]
for agent in agents:
assert hasattr(agent, 'name')
assert hasattr(agent, 'context_manager')
assert hasattr(agent, 'process_message')
Async Fixtures
@pytest.mark.integration
@pytest.mark.asyncio
class TestAsyncOperations:
async def test_workflow_execution(self, mock_workflow_engine, mock_tool_orchestrator):
# Test workflow engine
result = await mock_workflow_engine.execute_workflow(
"test_workflow",
["agent1", "agent2"],
{"config": "test"}
)
assert result["success"] is True
# Test tool orchestrator
tool_result = await mock_tool_orchestrator.execute_tool(
"test_tool",
{"arg1": "value1"},
tool_id="test_tool_1"
)
assert tool_result.status == "COMPLETED"
Performance Fixtures
@pytest.mark.integration
@pytest.mark.performance
class TestPerformance:
def test_monitoring(self, mock_performance_monitor, test_metrics_collector):
# Start monitoring
mock_performance_monitor.start_monitoring()
# Record metrics
test_metrics_collector.record_metric("execution_time", 1.5, {"test": "example"})
# Stop monitoring
metrics = mock_performance_monitor.stop_monitoring()
assert hasattr(metrics, 'execution_time')
assert hasattr(metrics, 'peak_memory_mb')
Parametrized Fixtures
@pytest.mark.integration
@pytest.mark.core
class TestParametrizedScenarios:
def test_workflow_scenarios(self, workflow_scenario):
# Automatically tests with different workflow configurations
assert workflow_scenario['workflow_type'] in ["sequential", "parallel", "iterative", "human_in_loop"]
assert workflow_scenario['agent_count'] > 0
def test_context_scenarios(self, context_scenario):
# Automatically tests with different context sizes
assert context_scenario['context_size'] > 0
assert context_scenario['token_limit'] > context_scenario['context_size']
Phase-Specific Fixtures
@pytest.mark.integration
@pytest.mark.foundation
class TestFoundationPhase:
def test_foundation_setup(self, foundation_test_setup):
# Complete foundation test setup
assert 'context_manager' in foundation_test_setup
assert 'agent_pool' in foundation_test_setup
assert 'workflow_configs' in foundation_test_setup
@pytest.mark.integration
@pytest.mark.core
class TestCorePhase:
def test_core_setup(self, core_integration_setup):
# Complete core integration test setup
assert 'smart_prioritizer' in core_integration_setup
assert 'cross_turn_correlator' in core_integration_setup
assert 'rag_system' in core_integration_setup
Test Markers
Use these markers to categorize and run specific test types:
@pytest.mark.integration # All integration tests
@pytest.mark.foundation # Foundation phase tests
@pytest.mark.core # Core integration phase tests
@pytest.mark.orchestration # Tool orchestration phase tests
@pytest.mark.verification # Performance verification phase tests
@pytest.mark.performance # Performance tests
@pytest.mark.slow # Slow tests (>5 seconds)
@pytest.mark.stress # Stress tests
@pytest.mark.load # Load tests
Example: Complete Integration Test
@pytest.mark.integration
@pytest.mark.foundation
class TestCompleteScenario:
@pytest.mark.asyncio
async def test_complete_workflow(
self,
mock_devops_agent,
mock_context_manager,
mock_workflow_engine,
mock_performance_monitor,
test_metrics_collector
):
# Start performance monitoring
mock_performance_monitor.start_monitoring()
test_metrics_collector.record_metric("test_start", 1.0, {"phase": "setup"})
# Setup context
mock_context_manager.add_code_snippet("main.py", "def main(): pass")
mock_context_manager.add_tool_result("analyze_code", {"complexity": "low"})
# Start conversation turn
turn_id = mock_context_manager.start_new_turn("Fix the code issues")
# Process message through agent
response = await mock_devops_agent.process_message("Fix the code issues")
assert response["success"] is True
# Execute workflow
workflow_result = await mock_workflow_engine.execute_workflow(
"fix_issues",
[mock_devops_agent],
{"priority": "high"}
)
assert workflow_result["success"] is True
# Stop performance monitoring
performance_metrics = mock_performance_monitor.stop_monitoring()
test_metrics_collector.record_metric(
"test_execution_time",
performance_metrics.execution_time,
{"test": "complete_workflow"}
)
# Verify final state
context, token_count = mock_context_manager.assemble_context(50000)
assert len(context["conversation_history"]) == 1
assert len(context["code_snippets"]) == 1
assert len(context["tool_results"]) == 1
Choosing Between Traditional and Evaluation Testing
When to Use Traditional Integration Tests
Best for:
- Component Integration - Testing how system components work together
- Error Handling - Validating recovery mechanisms and fallback behavior
- Performance Testing - Load testing, stress testing, and optimization validation
- Infrastructure Testing - Database connections, external API integrations
- Mock-Heavy Scenarios - Testing with controlled, predictable conditions
Example Use Cases:
# Test system performance under load
def test_high_load_performance(self, mock_concurrent_users):
# Traditional approach excels at controlled performance testing
# Test error recovery mechanisms
def test_database_connection_failure_recovery(self, mock_db_failure):
# Traditional approach better for infrastructure failure simulation
When to Use ADK Evaluation Tests
Best for:
- Behavioral Validation - Testing how agents actually respond to real queries
- Tool Usage Patterns - Ensuring agents use tools correctly in context
- Multi-Agent Coordination - Validating agent communication and workflow
- User Experience Testing - Testing scenarios that mirror actual user interactions
- Response Quality - Validating the quality and relevance of agent responses
Example Use Cases:
// Test real agent behavior with natural language
{
"query": "Help me optimize this slow database query",
"expected_behavior": "Should analyze query, identify bottlenecks, suggest optimizations"
}
// Test multi-agent coordination
{
"query": "Coordinate a code review with the team",
"expected_coordination": "Should orchestrate workflow between multiple agents"
}
Combined Testing Strategy
Optimal Approach:
- Use Traditional Tests for infrastructure, performance, and error handling
- Use Evaluation Tests for behavior, tool usage, and user scenarios
- Combine Both for comprehensive coverage
Example Testing Matrix:
Test Type | Traditional | Evaluation |
---|---|---|
Component Integration | ✅ Primary | ❌ Not suitable |
Agent Behavior | ⚠️ Limited | ✅ Primary |
Tool Usage | ⚠️ Structural only | ✅ Behavioral |
Performance | ✅ Primary | ❌ Not suitable |
Error Handling | ✅ Primary | ⚠️ Some scenarios |
User Scenarios | ❌ Not suitable | ✅ Primary |
Multi-Agent Coordination | ⚠️ Structural only | ✅ Behavioral |
Response Quality | ❌ Not suitable | ✅ Primary |
Migration Strategy
Gradual Adoption:
- Keep existing traditional tests - They provide valuable infrastructure coverage
- Add evaluation tests for new features and critical user scenarios
- Identify overlaps - Replace structural tests with behavioral tests where appropriate
- Maintain both approaches - Each serves different but complementary purposes
Coverage Goals:
- Traditional Tests - 80%+ code coverage, infrastructure validation
- Evaluation Tests - 100% critical user scenario coverage, behavioral validation
- Combined - Complete confidence in both system reliability and user experience
Best Practices
Test Organization
Traditional Tests:
- Follow the 4-phase structure - Foundation → Core → Tool → Performance
- Use descriptive test names - Clear intent and scope
- Implement proper fixtures - Reusable test components
- Mock external dependencies - Isolated test execution
Evaluation Tests:
- Mirror user scenarios - Write queries as real users would
- Focus on behavior - Test what agents do, not just structure
- Use realistic data - Include actual code samples, real-world problems
- Validate tool usage - Ensure agents use tools correctly in context
Performance Considerations
Traditional Tests:
- Use parallel execution - Faster test runs where possible
- Monitor resource usage - Prevent test environment impact
- Set appropriate timeouts - Balance thoroughness with speed
- Profile slow tests - Identify optimization opportunities
Evaluation Tests:
- Cache evaluation results - Avoid redundant scenario execution
- Use realistic timeouts - Account for actual agent thinking time
- Batch similar scenarios - Group related tests for efficiency
- Monitor response quality - Track degradation over time
Maintenance
Traditional Tests:
- Keep tests up-to-date - Sync with system changes
- Review test coverage - Ensure comprehensive validation
- Update mocks regularly - Maintain realistic behavior
- Document new patterns - Share knowledge with team
Evaluation Tests:
- Update scenarios regularly - Keep pace with user needs
- Validate tool expectations - Ensure tool names and parameters are current
- Review response patterns - Update expected behaviors as agents improve
- Maintain scenario diversity - Cover edge cases and new use cases
Quality Assurance
Evaluation Test Quality:
- Write clear queries - Unambiguous user intentions
- Define specific outcomes - Clear success criteria
- Include edge cases - Error scenarios and boundary conditions
- Test coordination patterns - Multi-agent workflows and handoffs
- Validate memory usage - Session continuity and context awareness
Example Quality Checklist:
{
"scenario_quality_checks": {
"clear_user_intent": "✅ Query represents realistic user need",
"specific_tools": "✅ Expected tools are precisely defined",
"measurable_outcome": "✅ Success criteria are clearly defined",
"realistic_context": "✅ Scenario reflects actual usage patterns",
"coordination_tested": "✅ Multi-agent interactions are validated"
}
}
Integration with CI/CD
GitHub Actions Integration
name: Integration Tests
on: [push, pull_request]
jobs:
integration-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
uv sync
- name: Run traditional integration tests
run: |
./tests/integration/run_integration_tests.py --parallel
- name: Run ADK evaluation tests
run: |
uv run pytest tests/integration/test_adk_evaluation_patterns.py -v
- name: Validate evaluation test structure
run: |
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_evaluation_test_files_exist -v
Performance Monitoring
Set up performance thresholds:
- name: Check performance thresholds
run: |
python -c "
import json
with open('test_reports/integration_test_report_latest.json') as f:
report = json.load(f)
avg_duration = report['performance_metrics']['average_test_duration']
if avg_duration > 2.0:
raise Exception(f'Average test duration too high: {avg_duration}s')
print(f'Performance check passed: {avg_duration}s average')
"
Troubleshooting
Common Issues
- Test Timeouts - Increase timeout values or optimize slow tests
- Memory Issues - Check for memory leaks in test setup/teardown
- Mock Failures - Ensure mocks match actual system behavior
- Flaky Tests - Add proper wait conditions and state validation
Debug Mode
Enable verbose logging for detailed debugging:
./tests/integration/run_integration_tests.py --verbose
Environment Issues
Ensure proper test environment setup:
# Install test dependencies
uv sync --dev
# Set environment variables
export DEVOPS_AGENT_TESTING=true
export DEVOPS_AGENT_LOG_LEVEL=DEBUG
Next Steps
Getting Started
- Run the complete test suite to validate your implementation
# Run all tests (traditional + evaluation) uv run pytest tests/ --cov=src --cov-config=pyproject.toml --cov-report=term
- Review test reports for performance insights
- Integrate into CI/CD for continuous validation
Expanding Test Coverage
Traditional Tests:
- Add infrastructure tests for new components and integrations
- Expand performance tests for scalability validation
- Include error handling tests for robustness
Evaluation Tests:
- Create user scenario tests for new features
- Add coordination tests for multi-agent workflows
- Include memory tests for session continuity
- Validate tool usage for new tools and capabilities
Sample Evaluation Test Creation
# Create new evaluation test file
touch tests/integration/evaluation_tests/my_new_feature.evalset.json
# Add test validation
# Edit: tests/integration/test_adk_evaluation_patterns.py
# Add: test_my_new_feature_evaluation()
Team Adoption
- Share knowledge with your development team
- Document evaluation patterns for your specific use cases
- Establish review processes for both traditional and evaluation tests
- Monitor test quality and update scenarios regularly
For detailed information about specific test patterns, see the Test Patterns Guide.
For performance testing specifics, see the Performance Testing Guide.
For troubleshooting help, see the Troubleshooting Guide.