Agent Evaluation Framework
The Agent Evaluation Framework provides tools for measuring and improving agent performance, ensuring consistent quality across different use cases.
Current Status
Status: Phase 1 Implemented
Phase 1 of the custom AgentDock Core Evaluation Framework has been implemented. This includes the core runner, evaluator interface, storage provider concept, and a suite of initial evaluators (RuleBased, LLMJudge, NLPAccuracy, ToolUsage, Lexical Suite).
Overview
The framework offers:
- Extensible Architecture: Based on a core
Evaluatorinterface. - Suite of Built-in Evaluators: Covering rule-based checks, LLM-as-judge, semantic similarity, tool usage, and lexical analysis.
- Configurable Runs: Using
EvaluationRunConfigto select evaluators and criteria. - Aggregated Results: Providing detailed outputs with scores, reasoning, and metadata.
- Optional Persistence: Basic file-based logging (
JsonFileStorageProvider) implemented, with potential for future integration with a Storage Abstraction Layer.
Architecture (Phase 1 Implementation)
Implementation Options
A Custom Implementation within AgentDock Core was chosen and developed for Phase 1. This provides:
- Full control over the evaluation process.
- Tight integration with AgentDock types (
AgentMessage, etc.). - Specific evaluators tailored to agent use cases (e.g.,
ToolUsageEvaluator). - An extensible base for future enhancements.
Third-party integrations were deferred to allow for a bespoke foundation matching AgentDock's architecture.
Key Components (Phase 1)
-
EvaluationInput: Data packet including response, prompt, history, ground truth, context, criteria. -
EvaluationCriteria: Defines metrics with name, description, scale, and optional weight. -
EvaluatorInterface: Core extensibility point (type,evaluatemethod). -
EvaluationResult: Output per criterion (score, reasoning, type). -
EvaluationRunConfig: Specifies evaluators, their configs, optional storage provider, metadata. -
EvaluationRunner: Orchestrates the run viarunEvaluationfunction. -
AggregatedEvaluationResult: Final combined output with overall score (if applicable), individual results, snapshots. -
JsonFileStorageProvider: Basic implementation for server-side result logging.
Key Features (Phase 1)
- Rule-Based Checks: Length, includes, regex, JSON validity.
- LLM-as-Judge: Qualitative assessment via LLM call with templating.
- Semantic Similarity: Cosine similarity using pluggable embedding models (default provided).
- Tool Usage Validation: Checks tool calls, arguments against expectations.
- Lexical Analysis: Similarity (Levenshtein, Dice, etc.), keyword coverage, sentiment (VADER), toxicity (blocklist).
- Flexible Input Sourcing: Evaluators can pull text from
response,prompt,groundTruth, or nestedcontextfields. - Score Normalization & Aggregation: Runner attempts to normalize scores to 0-1 and calculate weighted average.
- Basic Persistence: Optional JSONL file logging.
- Comprehensive Unit Tests: Added for core components and evaluators.
Benefits (Achieved in Phase 1)
- Foundational Quality Assurance: Basic framework for consistent checks.
- Extensible Base: Custom evaluators can be built.
- Initial Benchmarking: Enables comparison of runs via results.
- Concrete Metrics: Moves beyond subjective assessment for core areas.
Timeline
| Phase | Status | Description |
|---|---|---|
| Phase 1 architecture designed and implemented. | ||
| Core Implementation | Completed (Phase 1) | Basic framework, runner, interface, initial evaluators, storage provider implemented. |
| Phase 2 / Advanced Features | Planned | See PRD for details (e.g., Advanced evaluator configs, UI integration, enhanced storage, etc.). |
Use Cases
Agent Development
Apply evaluations during development to iteratively improve quality:
The implemented Phase 1 framework provides the core capabilities for this loop. Refer to the Evaluation Framework PRD for detailed usage and Phase 2 plans.