PRD: AgentDock Evaluation Framework - Building for Measurable Quality
1. Introduction: The Need for Standardized Evaluation
As AgentDock evolves, the ability to systematically measure and improve agent quality becomes paramount. While various ad-hoc methods and external tools have been used previously, a standardized, integrated Evaluation Framework within AgentDock Core is essential to:
- Ensure Consistency: Provide a common yardstick for quality across all agents and development cycles.
- Enable Systematic Improvement: Establish a data-driven feedback loop to identify weaknesses and track progress effectively.
- Facilitate Benchmarking: Reliably compare the performance of different models, prompts, or agent versions.
- Guarantee Production Readiness: Implement objective quality gates before deploying agents into production environments.
Building this framework is a foundational step towards delivering robust, reliable, and continuously improving AI agents. (Ref: GitHub Issue #105).
2. The Goal: A Foundational, Adaptable Evaluation Core
The objective is clear: Implement a modular and extensible Evaluation Framework within agentdock-core
. We are not building every possible metric upfront. The focus is squarely on the foundational architecture: interfaces, data structures, execution logic, and integration points. This foundation must allow developers to:
- Define diverse evaluation criteria specific to their needs.
- Implement various evaluation methods (
Evaluators
) using a standard contract. - Execute evaluations systematically.
- Aggregate and store results for analysis.
- Critically: Integrate external tools or custom logic without modifying the core framework. Adaptability is a fundamental design principle, not an afterthought.
The goal is straightforward: Give developers the tools to systematically measure and improve agent quality using methods appropriate for their specific constraints. It needs to function both as an in-process library and be suitable for wrapping in a service layer later.
Intended Use Cases (Beyond Simple Invocation)
This framework must support standard development workflows:
- CI/CD Integration: Run evaluation suites automatically on code/model/prompt changes to catch regressions.
- Benchmarking: Systematically compare agent versions, LLMs, or prompts against standard datasets/criteria.
- Observability Integration: Feed structured evaluation results (scores, metadata, failures) into monitoring/tracing systems (e.g., OpenTelemetry).
- Production Monitoring: Allow periodic evaluation runs on live traffic samples (potentially with different criteria than CI).
3. Scope: What We're Building Now (and What We're Not)
Focus is essential. We build the core infrastructure first.
In Scope (Phase 1 - Largely Completed):
- ๐ข Core Architecture: Defined TypeScript interfaces (
EvaluationInput
,EvaluationResult
,EvaluationCriteria
,AggregatedEvaluationResult
,Evaluator
,EvaluationStorageProvider
). These contracts are non-negotiable. - ๐ข Evaluation Runner: Implemented the
EvaluationRunner
orchestrator, including score normalization and weighted aggregation. - ๐ข Initial Evaluators: Provided essential building blocks:
-
RuleBasedEvaluator
: For simple, fast, deterministic checks (e.g., keyword presence, length, includes, JSON parsing). Cheap, essential guardrails. -
LLMJudgeEvaluator
: Uses a configurableCoreLLM
(via Vercel AI SDK) for nuanced quality assessment. Expensive but necessary for subjective measures.
-
- ๐ข Criteria Definition: Mechanism to define/manage
EvaluationCriteria
sets programmatically is in place. - ๐ข Result Aggregation: Implemented weighted averaging with score normalization (0-1 range where applicable) in the runner.
- ๐ข Storage Interface & Basic Implementation: Defined
EvaluationStorageProvider
interface. ProvidedJsonFileStorageProvider
which appends JSON to a local file. - ๐ข Core Integration Points & Example:
runEvaluation
function is established, andrun_evaluation_example.ts
demonstrates its usage. - ๐ก Unit Tests: Foundational tests for core types and some components exist, but comprehensive coverage for all evaluators, runner logic, and storage provider contracts is still pending.
Out of Scope (Initial Version - No Change):
- UI/Dashboard: No frontend visualization. Focus is on the backend engine.
- Dedicated Scalable Database Backend: Default file storage is for utility. Robust storage (PostgreSQL, MLOps DBs) requires separate
EvaluationStorageProvider
implementations later. - Dedicated HTTP Service Layer: Design must allow wrapping in a service, but building that service is out of scope for Phase 1.
- Specific 3rd-Party Tool Wrappers: Won't build wrappers for DeepEval/TruLens initially, but the
Evaluator
interface must make this straightforward. - Advanced NLP/Statistical Metrics: Complex metrics (BLEU, ROUGE) can be added as custom
Evaluator
implementations later. - Human Feedback Annotation UI: Framework should ingest structured human feedback, but the UI for collection is external.
4. Functional Requirements: What It Must Do
- FR1: Define Evaluation Criteria: ๐ข Implemented.
- Provide a clear mechanism to define individual evaluation criteria, including
name
(string, unique identifier),description
(string, explanation for humans),scale
(EvaluationScale
enum/union type), and an optionalweight
(number, for weighted aggregation). - Support loading or managing sets of these criteria for specific evaluation runs (currently programmatically).
- Provide a clear mechanism to define individual evaluation criteria, including
- FR2: Implement Diverse Evaluators: ๐ข Core interface and initial evaluators implemented.
- Define a standard
Evaluator
interface contract:interface Evaluator { type: string; /* Unique identifier for the evaluator type */ evaluate(input: EvaluationInput, criteria: EvaluationCriteria[]): Promise<EvaluationResult[]>; }
. - FR2.1 (Rule-Based): ๐ข Implemented
RuleBasedEvaluator
. This evaluator is configurable with a set of rules, where each rule is linked to a specificEvaluationCriteria
(by name) and performs a deterministic check (e.g., regex match, length check, keyword count, JSON parse). It provides fast, low-cost checks suitable for basic validation. - FR2.2 (LLM-as-Judge): ๐ข Implemented
LLMJudgeEvaluator
. This evaluator accepts a configuredCoreLLM
instance (compatible with Vercel AI SDK). It uses robust prompt templating andgenerateObject
for structured output to assess theEvaluationInput
against the providedEvaluationCriteria
. It reliably parses the LLM's response to extract scores and reasoning. - FR2.3 (NLP-Accuracy - Semantic): ๐ข Implemented
NLPAccuracyEvaluator
. This evaluator uses embedding models (via Vercel AI SDK or compatible providers) to generate vector embeddings for the agent's response andgroundTruth
. It then calculates cosine similarity, providing a score for semantic alignment. Essential for understanding meaning beyond lexical match. - FR2.4 (Tool Usage): ๐ข Implemented
ToolUsageEvaluator
. This rule-based evaluator checks for correct tool invocation, argument validation, and required tool usage based on configured rules. It inspectsmessageHistory
orcontext
for tool call data. - FR2.5 (Lexical Suite - Practical & Fast): ๐ข Implemented a suite of practical, non-LLM lexical evaluators for rapid, cost-effective checks:
-
LexicalSimilarityEvaluator
: Measures string similarity (Sorensen-Dice, Jaro-Winkler, Levenshtein) between a source field (e.g., response) and a reference field (e.g., groundTruth). Useful for assessing how close an answer is to an expected textual output. -
KeywordCoverageEvaluator
: Determines the percentage of predefined keywords found in a source text. Critical for ensuring key concepts or entities are addressed. Configurable for case sensitivity, keyword source (config, groundTruth, context), and whitespace normalization. -
SentimentEvaluator
: Analyzes the sentiment of a text (positive, negative, neutral) using an AFINN-based library. Provides options for normalized comparative scores, raw scores, or categorical output. Essential for gauging the tone of a response. -
ToxicityEvaluator
: Scans text for a predefined list of toxic terms. Returns a binary score (toxic/not-toxic). A fundamental check for safety and appropriateness. Configurable for case sensitivity and whole-word matching.
-
- FR2.6 (Extensibility): ๐ข The framework makes it straightforward for developers to create and integrate their own custom
Evaluator
classes by simply implementing theEvaluator
interface. This is the primary hook for custom logic.
- Define a standard
- FR3: Execute Evaluations Systematically: ๐ข Implemented.
- The
EvaluationRunner
component orchestrates the evaluation process. - It accepts an
EvaluationInput
object and anEvaluationRunConfig
(which includesevaluatorConfigs
andcriteria
defined in the input). - It iterates through the configured evaluators, invoking their
evaluate
method. - It handles errors at the individual evaluator level, logging errors and continuing where possible.
- Execution leverages asynchronous operations (
Promise
). - It collects all successfully generated
EvaluationResult
objects.
- The
- FR4: Aggregate Evaluation Results: ๐ข Implemented.
- The
EvaluationRunner
aggregates the collectedEvaluationResult[]
into a singleAggregatedEvaluationResult
object. - It supports weighted average scoring, normalizing scores from different scales (e.g., Likert, boolean, numeric 0-1 or 0-100) to a 0-1 range for consistent aggregation where appropriate.
- The
- FR5: Store Evaluation Results Persistently: ๐ข Implemented.
- Defined a clear, serializable schema for
AggregatedEvaluationResult
, capturing essential information. - Defined
EvaluationStorageProvider
interface:interface EvaluationStorageProvider { saveResult(result: AggregatedEvaluationResult): Promise<void>; }
. - Provided
JsonFileStorageProvider
, which appends the serializedAggregatedEvaluationResult
to a file.
- Defined a clear, serializable schema for
- FR6: Integrate with Core AgentDock: ๐ข Implemented.
- The primary invocation API
runEvaluation(input: EvaluationInput, config: EvaluationRunConfig): Promise<AggregatedEvaluationResult>
is established. - The
EvaluationRunConfig
expectsevaluatorConfigs
(an array ofRuleBasedEvaluatorConfig | LLMJudgeEvaluatorConfig
) which specify the type and specific configuration for each evaluator to be instantiated by the runner.
- The primary invocation API
5. Non-Functional Requirements: Ensuring Production Readiness
Beyond just features, the framework must be built for real-world use.
- NFR1: Modularity & Extensibility: This is paramount. The design must heavily rely on interfaces (
Evaluator
,EvaluationStorageProvider
) to ensure loose coupling. Adding new evaluation methods or storage backends should require no changes to the coreEvaluationRunner
. The architecture must inherently support different deployment models (e.g., running evaluations as an in-process library call vs. wrapping the core logic in a separate microservice). This future-proofs the framework. - NFR2: Configurability: Users must be able to easily configure evaluation runs: selecting which evaluators to use, defining the criteria set, adjusting settings for specific evaluators (e.g., the LLM model for the judge), and specifying the storage provider. Usability depends on good configuration options.
- Configuration Strategy Options:
- ๐ข Programmatic (Primary for Phase 1): Configuration objects are passed directly to the
runEvaluation
function viaEvaluationRunConfig
. - File-Based (Future Consideration): Design should not preclude loading evaluation configurations (criteria definitions, evaluator selections, specific settings) from dedicated configuration files (e.g.,
evaluation.config.ts
, JSON files). - Agent Definition Integration (Future Consideration): Potentially allow defining default evaluation configurations as part of an agent's overall definition.
- Initial implementation will focus on programmatic configuration for simplicity and direct control, but the underlying structures should support file-based loading later.
- ๐ข Programmatic (Primary for Phase 1): Configuration objects are passed directly to the
- Configuration Strategy Options:
- NFR3: Performance & Cost Awareness: LLM-based evaluations can be slow and expensive.
- The framework must support selective execution of evaluators (e.g., running only fast rule-based checks in some contexts).
- IO-bound evaluators (
LLMJudgeEvaluator
, storage providers) must operate asynchronously (Promise
-based) to avoid blocking the main thread. - Documentation should clearly outline the relative cost and latency implications of different evaluators (e.g., RuleBased vs. LLMJudge). Production decisions often hinge on these factors.
- NFR4: Testability: All core components (runner, evaluators, storage providers, type definitions) must be designed for unit testing. Dependencies should be injectable or easily mockable. Reliable software development demands comprehensive testing.
6. High-Level Architecture & Key Data Structures
The implementation will reside primarily within a new top-level directory in the core library.
- Primary Directory:
agentdock-core/src/evaluation/
- Core Types (
evaluation/types.ts
):-
EvaluationScale = 'binary' | 'likert5' | 'numeric' | 'pass/fail' | string;
// binary: Simple yes/no, true/false. (Normalized to 0 or 1) // likert5: Standard 1-5 rating scale. (Normalized to 0-1: (score-1)/4) // numeric: Any plain number score. (If 0-1, used as is. If 0-100, normalized to 0-1 by dividing by 100. Other ranges currently not normalized for aggregation unless they are 0 or 1). // pass/fail: Clear categorical outcome. (Normalized to 0 or 1) // string: For custom scales or categorical results. (Normalized to 0 or 1 if 'true'/'false', 'pass'/'fail', etc., otherwise not typically included in numeric aggregation unless parsable to a number and fitting a numeric/likert scale). -
EvaluationCriteria
:{ name: string; // Unique identifier for the criterion description: string; // Human-readable explanation scale: EvaluationScale; // The scale used for scoring this criterion weight?: number; // Optional weight for aggregation }
-
EvaluationInput
:{ // Rich context for the evaluation prompt?: string; // Optional initiating prompt response: string | AgentMessage; // The agent output being evaluated context?: Record<string, any>; // Arbitrary contextual data groundTruth?: string | any; // Optional reference answer/data criteria: EvaluationCriteria[]; // Criteria being evaluated against agentConfig?: Record<string, any>; // Snapshot of agent config at time of response messageHistory?: AgentMessage[]; // Relevant message history timestamp?: number; // Timestamp of the response generation sessionId?: string; // Identifier for the session/conversation agentId?: string; // Identifier for the agent instance metadata?: Record<string, any>; // Other arbitrary metadata (e.g., test runner context if applicable) }
-
EvaluationResult
:{ // Result for a single criterion from one evaluator criterionName: string; // Links back to EvaluationCriteria.name score: number | boolean | string; // The actual score/judgment reasoning?: string; // Optional explanation from the evaluator (esp. LLM judge) evaluatorType: string; // Identifier for the evaluator producing this result error?: string; // Error message if this specific evaluation failed metadata?: Record<string, any>; // Evaluator-specific metadata }
-
AggregatedEvaluationResult
:{ // Overall result for an evaluation run overallScore?: number; // Optional aggregated score (e.g., weighted avg) results: EvaluationResult[]; // List of individual results from all evaluators timestamp: number; // Timestamp of the evaluation run completion agentId?: string; // Copied from input sessionId?: string; // Copied from input inputSnapshot: EvaluationInput; // Capture the exact input used evaluationConfigSnapshot?: { evaluatorTypes: string[]; criteriaNames: string[]; storageProviderType: string; metadataKeys: string[]; }; // Snapshot of criteria, evaluators used metadata?: Record<string, any>; // Run-level metadata }
-
Evaluator
:interface Evaluator { type: string; evaluate(input: EvaluationInput, criteria: EvaluationCriteria[]): Promise<EvaluationResult[]>; }
-
EvaluationStorageProvider
:interface EvaluationStorageProvider { saveResult(result: AggregatedEvaluationResult): Promise<void>; }
-
- Sub-directories & Components:
-
evaluation/criteria/
: Utilities or helpers related to defining/managing criteria sets (if needed beyond simple objects). -
evaluation/evaluators/
: Implementations of theEvaluator
interface, organized into subdirectories by type (e.g.,rule-based/
,llm/
,nlp/
,tool/
,lexical/
). -
evaluation/runner/
: Implementation of theEvaluationRunner
logic (index.ts
). -
evaluation/storage/
: TheEvaluationStorageProvider
interface definition and concrete implementations (json_file_storage.ts
, potentially others later). -
evaluation/types.ts
: Location for all core type definitions and interfaces listed above. -
evaluation/index.ts
: Main entry point exporting the public API of the evaluation module (e.g.,runEvaluation
function, core types, interfaces).
-
7. Where We Start: Phased Implementation & Next Steps
On Test Implementation Timing. There's a common reflex to demand unit tests for every line of code the moment it's written. We called (NFR4) testability 'mandatory,' and fundamentally, that's not wrong. However, in the context of iterative development--especially when new capabilities are being forged--front-loading comprehensive unit tests for features still in flux often leads to wasted effort. My approach, grounded in experience shipping actual product, is more pragmatic:
- Build the core feature. Get it to a point where it functions and its core value can be assessed.
- Validate it in a realistic scenario. This could be through example scripts, integration into a local build--whatever proves it does the intended job effectively. This is about confirming what we've built is right.
- Refine based on this practical validation.
- Once the feature is stable and its design proven, then implement the comprehensive unit tests. These tests then serve their true purpose: to lock in the proven behavior and guard against regressions.
Writing tests for rapidly evolving or unproven code is an exercise in churn. We'll build, we'll validate functionally, and then we'll write the tests that matter for the long term. This ensures our testing effort is targeted and efficient, not just a checkbox exercise.
Note on Evaluator Test Scenarios: While initial functional validation (e.g., via run_evaluation_example.ts
) ensures core evaluator capabilities, the development of comprehensive test suites covering diverse edge cases (e.g., for ToolUsageEvaluator
: missing required tools, invalid arguments, multiple calls, different data sources) will be a dedicated effort during the unit test writing phase for each evaluator. This ensures robust coverage once the evaluator's primary functionality is stabilized.
We've built the foundation using a "tracer bullet" approach, establishing an end-to-end flow that validates the core architecture.
Status Legend:
- ๐ข: Done
- ๐ก: Needs Implementation/Refinement/Tests
- ๐ด: Not Started
Phase 1: Foundational Implementation (Largely Complete)
- ๐ข Establish Module & Structure: Created
agentdock-core/src/evaluation/
and sub-directories. - ๐ข Define Core Interfaces & Types: Implemented in
evaluation/types.ts
. - ๐ข Basic Criteria Handling:
EvaluationCriteria[]
defined and passed programmatically. - ๐ข Evaluation Runner Implemented: Core logic, evaluator instantiation from
evaluatorConfigs
, error handling, and score normalization with weighted aggregation are in place. - ๐ข Basic Storage Implementation:
JsonFileStorageProvider
implemented and functional. - ๐ข RuleBasedEvaluator Implemented: Supports regex, length, includes, json_parse rules.
- ๐ข LLMJudgeEvaluator Implemented: Uses Vercel AI SDK's
generateObject
for structured output andCoreLLM
. - ๐ข Example Script (
run_evaluation_example.ts
): Successfully demonstrates programmatic configuration and execution of the framework with both rule-based and LLM judges, outputting to console and JSONL file. Relocated toexamples/
directory.
Phase 1.5: Core Enhancements & New Evaluator Types (Largely Complete)
- ๐ข
NLPAccuracyEvaluator
Implementation (Semantic Similarity):- Goal: Evaluate how semantically similar an agent's response is to a ground truth reference.
- Approach: Created
agentdock-core/src/evaluation/evaluators/nlp/accuracy.ts
. This evaluator uses embedding models (e.g., via Vercel AI SDK or other compatible sentence transformer providers) to generate vector embeddings for both the agent's response and thegroundTruth
fromEvaluationInput
. It then calculates the cosine similarity between these embeddings. The resulting score (0-1 range) will represent the semantic accuracy. - Configuration:
NLPAccuracyEvaluatorConfig
allows specifying the embedding model and criterion name. - Output:
EvaluationResult
with the cosine similarity as the score. - Status: Implemented and functionally tested via example script. Unit tests pending.
- ๐ข
ToolUsageEvaluator
Implementation:- Goal: Assess if an agent correctly used its designated tools.
- Approach: Created
agentdock-core/src/evaluation/evaluators/tool/usage.ts
. This rule-based evaluator checks for expected tool calls, validates argument structure/content via custom functions, and enforces required tool usage. It sources tool call data frommessageHistory
(structuredtool_call
andtool_result
content parts) orcontext
. - Configuration:
ToolUsageEvaluatorConfig
takes an array ofToolUsageRule
s (specifyingcriterionName
,expectedToolName
,argumentChecks
function,isRequired
) and atoolDataSource
option. - Output:
EvaluationResult
(typically binary pass/fail per rule) for criteria like "ToolInvocationCorrectness", "ToolParameterAccuracy". - Status: Implemented and functionally tested via example script. Unit tests pending.
Phase 1.6: Practical Lexical Evaluator Suite (New & Complete)
This phase focuses on delivering a suite of fast, cost-effective, and practical lexical evaluators, providing essential checks without reliance on LLMs, aligning with a pragmatic evaluation philosophy.
- ๐ข
LexicalSimilarityEvaluator
Implementation:- Goal: Measure direct textual similarity between an agent's response and a reference.
- Approach: Implemented in
agentdock-core/src/evaluation/evaluators/lexical/similarity.ts
. Uses algorithms like Sorensen-Dice (default), Jaro-Winkler, or Levenshtein. - Configuration:
LexicalSimilarityEvaluatorConfig
includescriterionName
,sourceField
(e.g., 'response'),referenceField
(e.g., 'groundTruth'),algorithm
,caseSensitive
,normalizeWhitespace
. - Output:
EvaluationResult
with a normalized similarity score (0-1). - Status: Implemented and functionally tested. Unit tests pending.
- ๐ข
KeywordCoverageEvaluator
Implementation:- Goal: Ensure key terms or concepts are present in the agent's response.
- Approach: Implemented in
agentdock-core/src/evaluation/evaluators/lexical/keyword_coverage.ts
. Calculates the percentage ofexpectedKeywords
found in thesourceTextField
. - Configuration:
KeywordCoverageEvaluatorConfig
includescriterionName
,expectedKeywords
(orkeywordsSourceField
to pull fromgroundTruth
orcontext
),sourceTextField
,caseSensitive
,matchWholeWord
,normalizeWhitespace
. - Output:
EvaluationResult
with a coverage score (0-1). - Status: Implemented and functionally tested. Unit tests pending.
- ๐ข
SentimentEvaluator
Implementation:- Goal: Assess the emotional tone of the agent's response.
- Approach: Implemented in
agentdock-core/src/evaluation/evaluators/lexical/sentiment.ts
. Uses thesentiment
npm package (AFINN-based wordlist). - Configuration:
SentimentEvaluatorConfig
includescriterionName
,sourceTextField
,outputType
('comparativeNormalized', 'rawScore', 'category'), and thresholds for categorization. - Output:
EvaluationResult
with a sentiment score or category. - Status: Implemented and functionally tested. (Note:
sentiment
package is old, flagged for future review/replacement if needed). Unit tests pending.
- ๐ข
ToxicityEvaluator
Implementation:- Goal: Detect presence of undesirable or toxic language.
- Approach: Implemented in
agentdock-core/src/evaluation/evaluators/lexical/toxicity.ts
. Checks text against a list oftoxicTerms
. - Configuration:
ToxicityEvaluatorConfig
includescriterionName
,toxicTerms
,sourceTextField
,caseSensitive
,matchWholeWord
. - Output:
EvaluationResult
with a binary score (true if not toxic, false if toxic). - Status: Implemented and functionally tested. Unit tests pending.
Phase 1.6.1: Initial Core Documentation (New & Complete)
- ๐ข Detailed Evaluator Documentation:
- Goal: Provide clear, comprehensive documentation for each implemented evaluator and for creating custom evaluators.
- Approach: Created individual Markdown files for each evaluator within the
docs/evaluations/evaluators/
directory. Each document includes an overview, core workflow (with a Mermaid diagram), use cases, configuration guidance (with a conceptual code example), and expected output details. A guide for creating custom evaluators (custom-evaluators.md
) has also been created. - Covered Evaluators:
RuleBasedEvaluator
,LLMJudgeEvaluator
,NLPAccuracyEvaluator
,ToolUsageEvaluator
, and the Lexical Suite (LexicalSimilarityEvaluator
,KeywordCoverageEvaluator
,SentimentEvaluator
,ToxicityEvaluator
). An overview page for the Lexical Suite (lexical-evaluators.md
) was also added. - Status: Initial drafts complete and available in
docs/evaluations/
. The maindocs/evaluations/README.md
provides an overview and links.
Phase 1.7: Comprehensive Testing (Next Up)
- ๐ข Unit & Integration Tests: Systematically add tests for all evaluators (RuleBased, LLMJudge, NLPAccuracy, ToolUsage, and all Lexical evaluators), detailed runner logic (including edge cases for normalization and aggregation), and storage provider interactions. Ensure robust mocking of external dependencies like LLMs. (Note: One context variable assertion in LLMJudge test is temporarily skipped - requires investigation. Jaro-Winkler test in LexicalSimilarity is also skipped due to suspected library issue).
Phase 2: Advanced Features & Ecosystem Integration (Future Work)
- ๐ด Advanced Storage Solutions: Implement
EvaluationStorageProvider
for robust backends (e.g., PostgreSQL, specialized MLOps databases). - ๐ด Sophisticated Aggregation & Reporting: Allow for more complex aggregation strategies, configurable reporting formats, or basic statistical analysis on results.
- ๐ด Agent-Level Evaluation Paradigms: Develop patterns or specialized evaluators for assessing multi-turn conversation quality, complex task completion across multiple steps, or agent adherence to long-term goals.
- ๐ด Configuration from Files: Allow loading
EvaluationRunConfig
(or parts of it, like criteria sets or evaluator profiles) from static files (e.g., JSON, TS) for easier management of standard evaluation suites. - ๐ด Observability Enhancements: Deeper integration with tracing/logging systems, potentially emitting OpenTelemetry-compatible evaluation events.
- ๐ด UI for Evaluation Results: Develop a basic UI component or page within the AgentDock dashboard (or as a standalone tool) to view, filter, and compare evaluation results persisted via the SAL.
- ๐ด Benchmark Suite: Establish a standardized benchmark suite using the evaluation framework to track performance of core agent capabilities over time and across different model versions.
- ๐ด Community Evaluators: Document and streamline the process for community contributions of new evaluators to
agentdock-core
. - ๐ด Improve Test Mock Robustness: Transition simple test stubs (e.g., for
CoreLLM
in runner tests) to fulljest.mock()
implementations for better coverage and type safety as tests evolve. - ๐ด Refactor Score Normalization Logic: Extract the
normalizeEvaluationScore
function from the runner and its duplicate in tests into a shared utility to prevent drift and improve test reliability. - ๐ด Improve Evaluation Module Exports: Refactor
agentdock-core/evaluation
to expose necessary components likeJsonFileStorageProvider
via its public API (index.ts
) to avoid deep internal imports in consuming code.
Previously listed items (status may need review as Phase 2 progresses):
- ๐ก Resolve Skipped Tests: Investigate and fix the skipped context variable assertion test in
LLMJudgeEvaluator
and the Jaro-Winkler test inLexicalSimilarityEvaluator
. - ๐ก Implement Runner Validation: Complete the
validateEvaluatorConfigs
logic in theEvaluationRunner
to ensure evaluator configurations are valid before execution. - ๐ก Implement Advanced Evaluator Features: Address Phase 2 TODOs within existing evaluators, such as advanced LLM Judge configurations (e.g., reference-free evaluation), sequence checking in
ToolUsageEvaluator
, and potentially adding more algorithms or features to the Lexical Suite.