Evaluation Framework¶
Intent Kit provides a comprehensive evaluation framework for testing and benchmarking nodes against structured datasets. This framework enables systematic testing of node performance, accuracy, and reliability.
Overview¶
The evaluation framework includes: - Structured Datasets - YAML-based test case definitions - Comprehensive Metrics - Accuracy, performance, and cost tracking - Automated Testing - Batch evaluation of multiple test cases - Report Generation - Detailed performance reports and analysis - Regression Testing - Track performance changes over time
Basic Usage¶
Running Evaluations¶
from intent_kit.evals import evaluate_node, load_dataset
# Load a dataset
dataset = load_dataset("path/to/dataset.yaml")
# Evaluate a node
result = evaluate_node(dataset, node_instance)
# Print results
result.print_summary()
print(f"Accuracy: {result.accuracy():.1%}")
Command Line Interface¶
# Run all evaluations
uv run python -m intent_kit.evals.run_all_evals
# Run specific node evaluation
uv run python -m intent_kit.evals.run_node_eval --dataset classifier_node_llm.yaml
# Run with specific configuration
uv run python -m intent_kit.evals.run_node_eval --dataset extractor_node.yaml --config production.yaml
Dataset Format¶
YAML Dataset Structure¶
name: "Classifier Node Evaluation"
description: "Test cases for intent classification"
node_type: "classifier"
node_name: "intent_classifier"
test_cases:
- input: "Hello, how are you?"
expected: "greet"
context:
user_id: "test_user_1"
- input: "What's the weather like?"
expected: "weather"
context:
user_id: "test_user_2"
- input: "Book a table for 4 people"
expected: "booking"
context:
user_id: "test_user_3"
Test Case Structure¶
test_cases:
- input: "User input string"
expected: "Expected output or result"
context:
# Optional context data
user_id: "user123"
session_id: "session456"
metadata:
category: "greeting"
difficulty: "easy"
Creating Datasets¶
Classifier Node Dataset¶
name: "Intent Classification Dataset"
description: "Test cases for intent classification"
node_type: "classifier"
node_name: "intent_classifier"
test_cases:
# Greeting intents
- input: "Hello there"
expected: "greet"
- input: "Good morning"
expected: "greet"
- input: "Hi, how are you doing?"
expected: "greet"
# Weather intents
- input: "What's the weather like?"
expected: "weather"
- input: "Is it going to rain today?"
expected: "weather"
- input: "Weather forecast for tomorrow"
expected: "weather"
# Booking intents
- input: "I want to book a table"
expected: "booking"
- input: "Reserve a flight to New York"
expected: "booking"
- input: "Book a hotel room"
expected: "booking"
Extractor Node Dataset¶
name: "Parameter Extraction Dataset"
description: "Test cases for parameter extraction"
node_type: "extractor"
node_name: "booking_extractor"
test_cases:
- input: "Book a table for 4 people at 7 PM"
expected:
party_size: 4
time: "7 PM"
context:
required_params: ["party_size", "time"]
- input: "Reserve a flight from LA to NYC on March 15"
expected:
origin: "LA"
destination: "NYC"
date: "March 15"
context:
required_params: ["origin", "destination", "date"]
- input: "I need a hotel room for 2 nights starting tomorrow"
expected:
duration: 2
start_date: "tomorrow"
context:
required_params: ["duration", "start_date"]
Action Node Dataset¶
name: "Action Execution Dataset"
description: "Test cases for action execution"
node_type: "action"
node_name: "greet_action"
test_cases:
- input: "Hello"
expected: "Hello! How can I help you today?"
context:
user_name: "John"
- input: "Hi there"
expected: "Hi there! How can I help you today?"
context:
user_name: "Jane"
Advanced Dataset Features¶
Context-Aware Testing¶
test_cases:
- input: "What's my balance?"
expected: "Your current balance is $1,250.00"
context:
user_id: "user123"
account_balance: 1250.00
- input: "Transfer $100 to savings"
expected: "Transferred $100.00 to your savings account"
context:
user_id: "user123"
checking_balance: 500.00
savings_balance: 2000.00
Edge Case Testing¶
test_cases:
# Empty input
- input: ""
expected: "clarification"
# Very long input
- input: "This is a very long input that tests how the system handles extremely long user inputs that might exceed normal length limits and could potentially cause issues with processing or response generation"
expected: "booking"
# Special characters
- input: "Book a table for 2 @ 7:30 PM"
expected: "booking"
# Unicode characters
- input: "Réserver une table pour 4 personnes"
expected: "booking"
Performance Testing¶
test_cases:
# Simple queries
- input: "Hello"
expected: "greet"
context:
max_response_time: 1.0
# Complex queries
- input: "I need to book a flight from San Francisco to New York for next Tuesday, preferably in the morning, for 2 people, economy class, and I'd like to know about baggage allowance"
expected: "booking"
context:
max_response_time: 5.0
Running Evaluations¶
Programmatic Evaluation¶
from intent_kit.evals import evaluate_node, load_dataset
from intent_kit import DAGBuilder
# Load dataset
dataset = load_dataset("datasets/classifier_node_llm.yaml")
# Create DAG with classifier
builder = DAGBuilder()
builder.with_default_llm_config({
"provider": "openrouter",
"model": "google/gemma-2-9b-it"
})
builder.add_node("classifier", "classifier",
output_labels=["greet", "weather", "booking"],
description="Test classifier")
builder.set_entrypoints(["classifier"])
dag = builder.build()
# Run evaluation
result = evaluate_node(dataset, dag)
# Analyze results
print(f"Total tests: {result.total_count()}")
print(f"Passed: {result.passed_count()}")
print(f"Failed: {result.failed_count()}")
print(f"Accuracy: {result.accuracy():.1%}")
# Get failed tests
failed_tests = result.errors()
for test in failed_tests:
print(f"Failed: '{test.input}' -> Expected: {test.expected}, Got: {test.actual}")
Batch Evaluation¶
from intent_kit.evals import run_all_evals
from pathlib import Path
# Run all evaluations in a directory
datasets_dir = Path("intent_kit/evals/datasets")
results = run_all_evals(datasets_dir)
# Print summary
for dataset_name, result in results.items():
print(f"{dataset_name}: {result.accuracy():.1%} accuracy")
Configuration-Based Evaluation¶
# evaluation_config.yaml
evaluation:
datasets:
- path: "datasets/classifier_node_llm.yaml"
enabled: true
- path: "datasets/extractor_node.yaml"
enabled: true
- path: "datasets/action_node.yaml"
enabled: false
settings:
max_workers: 4
timeout: 30
retry_attempts: 3
reporting:
output_dir: "results"
format: "json"
include_metrics: true
Metrics and Analysis¶
Performance Metrics¶
# Get detailed metrics
result = evaluate_node(dataset, node)
# Timing metrics
for test_result in result.results:
if test_result.elapsed_time:
print(f"Response time: {test_result.elapsed_time:.2f}s")
if test_result.metrics:
print(f"Tokens used: {test_result.metrics.get('tokens', 'N/A')}")
print(f"Cost: ${test_result.metrics.get('cost', 0):.4f}")
Accuracy Analysis¶
# Analyze accuracy by category
def analyze_accuracy_by_category(result):
categories = {}
for test_result in result.results:
category = test_result.context.get('category', 'unknown')
if category not in categories:
categories[category] = {'passed': 0, 'total': 0}
categories[category]['total'] += 1
if test_result.passed:
categories[category]['passed'] += 1
for category, stats in categories.items():
accuracy = stats['passed'] / stats['total']
print(f"{category}: {accuracy:.1%} ({stats['passed']}/{stats['total']})")
Error Analysis¶
# Analyze common failure patterns
def analyze_errors(result):
error_patterns = {}
for test_result in result.errors():
error_type = type(test_result.actual).__name__
if error_type not in error_patterns:
error_patterns[error_type] = []
error_patterns[error_type].append(test_result.input)
for error_type, inputs in error_patterns.items():
print(f"{error_type}: {len(inputs)} occurrences")
for input_text in inputs[:3]: # Show first 3 examples
print(f" - {input_text}")
Report Generation¶
JSON Reports¶
import json
from datetime import datetime
def generate_report(result, dataset_name):
report = {
"dataset": dataset_name,
"timestamp": datetime.now().isoformat(),
"summary": {
"total_tests": result.total_count(),
"passed": result.passed_count(),
"failed": result.failed_count(),
"accuracy": result.accuracy()
},
"details": []
}
for test_result in result.results:
report["details"].append({
"input": test_result.input,
"expected": test_result.expected,
"actual": test_result.actual,
"passed": test_result.passed,
"elapsed_time": test_result.elapsed_time,
"metrics": test_result.metrics
})
return report
# Save report
report = generate_report(result, "classifier_evaluation")
with open("results/classifier_report.json", "w") as f:
json.dump(report, f, indent=2)
HTML Reports¶
def generate_html_report(results):
html = """
<html>
<head>
<title>Evaluation Results</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
.summary { background: #f5f5f5; padding: 15px; border-radius: 5px; }
.test-case { margin: 10px 0; padding: 10px; border: 1px solid #ddd; }
.passed { background: #d4edda; }
.failed { background: #f8d7da; }
</style>
</head>
<body>
<h1>Evaluation Results</h1>
"""
for dataset_name, result in results.items():
html += f"""
<div class="summary">
<h2>{dataset_name}</h2>
<p>Accuracy: {result.accuracy():.1%} ({result.passed_count()}/{result.total_count()})</p>
</div>
"""
for test_result in result.results:
status_class = "passed" if test_result.passed else "failed"
html += f"""
<div class="test-case {status_class}">
<strong>Input:</strong> {test_result.input}<br>
<strong>Expected:</strong> {test_result.expected}<br>
<strong>Actual:</strong> {test_result.actual}<br>
<strong>Time:</strong> {test_result.elapsed_time:.2f}s
</div>
"""
html += "</body></html>"
return html
Continuous Integration¶
GitHub Actions Workflow¶
# .github/workflows/evaluation.yml
name: Run Evaluations
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install uv
uv sync
- name: Run evaluations
run: |
uv run python -m intent_kit.evals.run_all_evals
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: evaluation-results
path: results/
Regression Testing¶
def check_regression(current_result, baseline_result, threshold=0.05):
"""Check if performance has regressed significantly."""
current_accuracy = current_result.accuracy()
baseline_accuracy = baseline_result.accuracy()
if current_accuracy < baseline_accuracy - threshold:
print(f"⚠️ Performance regression detected!")
print(f"Baseline: {baseline_accuracy:.1%}")
print(f"Current: {current_accuracy:.1%}")
print(f"Regression: {baseline_accuracy - current_accuracy:.1%}")
return False
print(f"✅ No significant regression detected")
return True
Best Practices¶
1. Comprehensive Test Coverage¶
# Include various input types
test_cases:
# Normal cases
- input: "Hello"
expected: "greet"
# Edge cases
- input: ""
expected: "clarification"
# Boundary cases
- input: "A" * 1000 # Very long input
expected: "greet"
# Special characters
- input: "Hello! How are you? 😊"
expected: "greet"
2. Realistic Test Data¶
# Use realistic, diverse test data
test_cases:
- input: "Hi there, how's it going?"
expected: "greet"
context:
user_demographic: "casual"
- input: "Good morning, sir. How may I assist you today?"
expected: "greet"
context:
user_demographic: "formal"
3. Performance Benchmarks¶
# Include performance expectations
test_cases:
- input: "Simple greeting"
expected: "greet"
context:
max_response_time: 1.0
max_tokens: 100
- input: "Complex multi-intent request"
expected: "multi_intent"
context:
max_response_time: 5.0
max_tokens: 500