ADK Evaluation Framework

The ADK Evaluation Framework provides behavioral testing for multi-agent systems, validating how agents actually behave in real-world scenarios rather than just testing structural components.

Overview

What is Evaluation Testing?

Evaluation testing focuses on agent behavior rather than system structure:

Traditional Testing: “Does the agent have the right components?”
Evaluation Testing: “Does the agent behave correctly when users interact with it?”

Key Benefits

Real Behavior Validation - Tests actual agent responses to natural language queries
User-Centric Testing - Scenarios mirror actual user interactions
Tool Usage Validation - Ensures agents use tools correctly in context
Multi-Agent Coordination - Validates communication and workflow between agents
Response Quality Assurance - Tests the quality and relevance of agent responses
Future-Proof Integration - Ready for official ADK evaluation module

Architecture

File Structure

tests/integration/evaluation_tests/
├── simple_code_analysis.evalset.json      # Basic agent functionality
├── sub_agent_delegation.evalset.json      # Agent hierarchy patterns
├── tool_usage.evalset.json                # Tool usage validation
├── multi_agent_coordination.evalset.json  # Coordination patterns
├── agent_memory_persistence.evalset.json  # Memory & persistence
└── test_config.json                       # Configuration

Test Validation

tests/integration/test_adk_evaluation_patterns.py
├── TestADKEvaluationPatterns
│   ├── test_evaluation_test_files_exist()
│   ├── test_evaluation_test_structure()
│   ├── test_multi_agent_coordination_evaluation()
│   ├── test_agent_memory_persistence_evaluation()
│   └── test_adk_evaluation_framework_readiness()

Creating Evaluation Tests

Basic Evaluation File Format

{
  "test_name": "Agent Behavior Evaluation",
  "description": "Tests agent responses to real-world scenarios",
  "version": "1.0.0",
  "test_scenarios": [
    {
      "scenario_id": "unique_scenario_identifier",
      "description": "What this scenario tests",
      "query": "Natural language user query",
      "expected_tool_use": [
        {
          "tool_name": "specific_tool_name",
          "inputs": {
            "parameter1": "expected_value",
            "parameter2": "expected_value"
          }
        }
      ],
      "expected_intermediate_agent_responses": [
        {
          "agent_type": "agent_name",
          "response_pattern": "expected_response_content",
          "coordination_actions": ["action1", "action2"]
        }
      ],
      "reference": "Expected outcome and behavior description"
    }
  ],
  "evaluation_criteria": {
    "criterion1": "What this criterion measures",
    "criterion2": "What this criterion measures"
  }
}

Pattern Examples

1. Simple Code Analysis Pattern

{
  "scenario_id": "basic_code_analysis",
  "description": "Test basic code analysis capabilities",
  "query": "Analyze this Python code for potential issues: def calculate(x, y): return x/y",
  "expected_tool_use": [
    {
      "tool_name": "code_analyzer",
      "inputs": {
        "code": "def calculate(x, y): return x/y",
        "language": "python"
      }
    }
  ],
  "expected_intermediate_agent_responses": [
    {
      "agent_type": "code_quality_agent",
      "response_pattern": "division by zero vulnerability",
      "coordination_actions": ["risk_assessment", "recommendation_generation"]
    }
  ],
  "reference": "Agent should identify division by zero risk and suggest input validation"
}

2. Multi-Agent Coordination Pattern

{
  "scenario_id": "feature_development_coordination",
  "description": "Test coordination between multiple agents for feature development",
  "query": "I need to implement a new user authentication feature. Please coordinate between design, development, and testing teams.",
  "expected_tool_use": [
    {
      "tool_name": "workflow_orchestrator",
      "inputs": {
        "workflow_type": "feature_development",
        "agents_required": ["design_pattern_agent", "code_review_agent", "testing_agent"],
        "coordination_strategy": "sequential_with_feedback"
      }
    }
  ],
  "expected_intermediate_agent_responses": [
    {
      "agent_type": "design_pattern_agent",
      "response_pattern": "authentication architecture design",
      "coordination_actions": ["state_update", "next_agent_notification"]
    },
    {
      "agent_type": "code_review_agent",
      "response_pattern": "implementation review feedback",
      "coordination_actions": ["quality_validation", "testing_handoff"]
    },
    {
      "agent_type": "testing_agent",
      "response_pattern": "test strategy and execution",
      "coordination_actions": ["validation_complete", "workflow_finalization"]
    }
  ],
  "reference": "Workflow should demonstrate proper agent handoffs, state sharing, and collaborative completion"
}

3. Memory & Persistence Pattern

{
  "scenario_id": "session_continuity_test",
  "description": "Test session continuity across multiple interactions",
  "query": "Remember that I'm working on a Flask web application. We discussed implementing user authentication. Now I need to add password reset functionality.",
  "expected_tool_use": [
    {
      "tool_name": "session_memory_manager",
      "inputs": {
        "operation": "retrieve_session_context",
        "session_id": "user_session_123",
        "context_keys": ["project_type", "framework", "previous_features"]
      }
    },
    {
      "tool_name": "persistent_memory_tool",
      "inputs": {
        "operation": "load_memory",
        "memory_type": "project_context",
        "filters": ["flask_authentication", "web_development"]
      }
    }
  ],
  "expected_intermediate_agent_responses": [
    {
      "agent_type": "memory_retrieval_agent",
      "response_pattern": "retrieved context about Flask project and authentication work",
      "coordination_actions": ["context_validation", "memory_integration", "continuity_establishment"]
    },
    {
      "agent_type": "development_agent",
      "response_pattern": "password reset implementation building on previous authentication work",
      "coordination_actions": ["context_utilization", "feature_integration", "knowledge_application"]
    }
  ],
  "reference": "Agent should demonstrate clear continuity from previous conversation, referencing Flask project and authentication implementation"
}

Running Evaluation Tests

Command Line Usage

# Run all evaluation tests
uv run pytest tests/integration/test_adk_evaluation_patterns.py -v

# Run specific evaluation categories
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_multi_agent_coordination_evaluation -v

# Run with integration suite
./tests/integration/run_integration_tests.py --suite "ADK Evaluation"

# Validate test files structure
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_evaluation_test_files_exist -v

Integration Test Runner

# Run complete test suite (traditional + evaluation)
./tests/integration/run_integration_tests.py

# Run only evaluation tests
./tests/integration/run_integration_tests.py --suite "Multi-Agent Coordination Evaluation Tests"
./tests/integration/run_integration_tests.py --suite "Agent Memory and Persistence Evaluation Tests"

Test Configuration

Configuration File (`test_config.json`)

{
  "criteria": {
    "tool_trajectory_avg_score": 0.8,
    "response_match_score": 0.7
  },
  "evaluation_settings": {
    "timeout_seconds": 30,
    "max_retries": 3,
    "parallel_execution": true,
    "cache_results": true
  },
  "quality_thresholds": {
    "minimum_scenarios_per_file": 3,
    "maximum_scenario_runtime": 60,
    "required_coverage_patterns": [
      "simple_analysis",
      "multi_agent_coordination",
      "memory_persistence"
    ]
  }
}

Best Practices

Writing Effective Scenarios

1. Natural Language Queries

// ✅ Good: Natural, realistic user query
{
  "query": "Help me optimize this slow database query that's affecting our user dashboard"
}

// ❌ Bad: Technical, unrealistic query
{
  "query": "Execute SQL optimization algorithm on provided query string"
}

2. Specific Tool Expectations

// ✅ Good: Specific tool with clear parameters
{
  "tool_name": "database_query_analyzer",
  "inputs": {
    "query": "SELECT * FROM users WHERE active = 1",
    "database_type": "postgresql"
  }
}

// ❌ Bad: Vague tool expectation
{
  "tool_name": "optimizer",
  "inputs": {
    "data": "some_query"
  }
}

3. Realistic Response Patterns

// ✅ Good: Specific, measurable response pattern
{
  "response_pattern": "query performance analysis with specific bottleneck identification",
  "coordination_actions": ["performance_measurement", "optimization_recommendations"]
}

// ❌ Bad: Vague response expectation
{
  "response_pattern": "good response",
  "coordination_actions": ["does_something"]
}

Scenario Categories

Core Functionality Tests

Basic agent capabilities - Simple queries and responses
Tool usage validation - Correct tool selection and parameter passing
Error handling - Graceful handling of edge cases

Multi-Agent Coordination Tests

Workflow orchestration - Sequential and parallel agent coordination
State sharing - Consistent state management across agents
Conflict resolution - Handling conflicting agent recommendations

Memory & Persistence Tests

Session continuity - Maintaining context across interactions
Cross-conversation memory - Retaining information between sessions
Knowledge evolution - Learning and improving over time

Quality Assurance

Scenario Quality Checklist

Clear User Intent - Query represents realistic user need
Specific Tools - Expected tools are precisely defined
Measurable Outcome - Success criteria are clearly defined
Realistic Context - Scenario reflects actual usage patterns
Coordination Tested - Multi-agent interactions are validated

Common Pitfalls

Overly Technical Queries - Use natural language, not technical jargon
Vague Expectations - Be specific about expected tools and responses
Unrealistic Scenarios - Ensure scenarios mirror actual user interactions
Missing Edge Cases - Include error conditions and boundary cases
Incomplete Coordination - Test full agent workflows, not just individual responses

Maintenance

Regular Updates

Monthly Reviews

Validate tool names - Ensure tool expectations match current implementations
Update response patterns - Reflect improvements in agent capabilities
Add new scenarios - Cover new features and user patterns
Review test results - Identify degradation or improvement trends

Quarterly Assessments

Scenario coverage analysis - Identify gaps in test coverage
Performance benchmarking - Track response quality over time
User feedback integration - Add scenarios based on user issues
Tool usage analysis - Ensure all critical tools are tested

Version Control

Semantic Versioning

Major versions - Significant changes to evaluation framework
Minor versions - New evaluation scenarios or patterns
Patch versions - Bug fixes and minor updates

Change Documentation

{
  "version": "1.2.0",
  "changelog": {
    "added": ["multi_agent_coordination scenarios", "memory_persistence patterns"],
    "modified": ["tool_usage validation criteria"],
    "deprecated": ["legacy_analysis_pattern"],
    "removed": ["outdated_coordination_scenario"]
  }
}

Future Roadmap

Official ADK Integration

Evaluation Module - Integration with official ADK evaluation when available
Standardized Metrics - Adoption of official ADK evaluation metrics
Automated Scoring - Integration with ADK scoring mechanisms

Advanced Features

Dynamic Scenario Generation - AI-generated test scenarios
Continuous Learning - Self-improving evaluation criteria
Performance Benchmarking - Automated performance comparison
Quality Regression Detection - Automatic detection of capability degradation

Community Contribution

Scenario Sharing - Community-contributed evaluation scenarios
Pattern Libraries - Reusable evaluation patterns
Best Practice Documentation - Community-driven best practices

Troubleshooting

Common Issues

Test Files Not Found

# Check file existence
ls tests/integration/evaluation_tests/

# Validate JSON structure
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_evaluation_test_files_exist -v

Invalid JSON Format

# Validate JSON syntax
python -m json.tool tests/integration/evaluation_tests/my_test.evalset.json

# Check test structure
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_evaluation_test_structure -v

Tool Name Mismatches

# Validate tool names
uv run pytest tests/integration/test_adk_evaluation_patterns.py::TestADKEvaluationPatterns::test_tool_names_match_agent_tools -v

Debug Mode

Enable verbose logging:

# Run with debug output
uv run pytest tests/integration/test_adk_evaluation_patterns.py -v -s --log-cli-level=DEBUG

Integration Testing Guide - Complete integration testing overview
Test Patterns Guide - Detailed patterns and examples
Performance Testing Guide - Performance testing specifics
Troubleshooting Guide - Common issues and solutions