AI Testing Frameworks: DeepEval, Ragas, LangSmith, CI Integration
Introduction
Testing AI applications is fundamentally different from testing traditional software. LLM outputs are non-deterministic, there is no single correct answer, and failures manifest as subtle quality degradations rather than crashes. Dedicated AI testing frameworks address these challenges with automated evaluation metrics, test case generation, and regression detection.
DeepEval
DeepEval is an open-source testing framework designed for LLM applications:
from deepeval import assert_test
from deepeval.metrics import (
HallucinationMetric,
AnswerRelevancyMetric,
ContextualPrecisionMetric,
FaithfulnessMetric,
BiasMetric,
ToxicityMetric,
)
from deepeval.test_case import LLMTestCase
def test_rag_response_no_hallucination():
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris, located in the Ile-de-France region.",
retrieval_context=["Paris is the capital and most populous city of France."],
)
hallucination_metric = HallucinationMetric(threshold=0.3)
assert_test(test_case, [hallucination_metric])
def test_response_relevancy():
test_case = LLMTestCase(
input="Explain Kubernetes pods",
actual_output="Kubernetes is a container orchestration platform...",
retrieval_context=[
"A pod is the smallest deployable unit in Kubernetes.",
"Pods can contain one or more containers.",
],
)
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [relevancy_metric])
# Run tests: deepeval test run test_ai.py
DeepEval supports 15+ evaluation metrics including hallucination detection, answer relevancy, faithfulness, contextual precision, bias detection, and toxicity scoring. Each metric returns a score (0-1) that can be compared against a configurable threshold.
Ragas
Ragas is specialized for evaluating RAG pipelines end-to-end:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset
test_data = Dataset.from_dict({
"question": [
"What is a Kubernetes pod?",
"How does load balancing work?",
],
"answer": [
"A pod is the smallest deployable unit...",
"Load balancing distributes traffic...",
],
"contexts": [
["Pods can contain one or more containers."],
["Load balancers distribute incoming traffic."],
],
"ground_truth": [
"A pod is the smallest deployable unit in Kubernetes.",
"Load balancing distributes network traffic across servers.",
],
})
# Compute RAG metrics
result = evaluate(
test_data,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
print(result)
# {
# "faithfulness": 0.92,
# "answer_relevancy": 0.88,
# "context_precision": 0.95,
# "context_recall": 0.85
# }
Ragas decomposes RAG quality into four independent metrics: faithfulness (is the answer grounded in context?), answer relevancy (does the answer address the question?), context precision (are retrieved documents relevant?), and context recall (are all relevant documents retrieved?).
LangSmith
LangSmith provides a hosted evaluation platform with tracing and annotation:
from langsmith import Client, evaluate
from langsmith.schemas import Example, Run
client = Client()
# Define a custom evaluator
def answer_correctness(run: Run, example: Example) -> dict:
# Compare model output against expected output
predicted = run.outputs.get("output", "")
expected = example.outputs.get("answer", "")
# Use LLM-as-judge for evaluation
from langsmith.evaluation import evaluate as langsmith_eval
return {"score": compute_similarity(predicted, expected)}
# Run evaluation on a dataset
results = evaluate(
lambda inputs: my_llm_chain(inputs["question"]),
data="my-test-dataset",
evaluators=[answer_correctness],
experiment_prefix="rag-v2-eval",
)
# View results in LangSmith dashboard
print(results)
LangSmith excels at trace-level evaluation. Every LLM call, retrieval step, and prompt template is recorded and can be compared across experiments.
CI Integration
Integrate AI tests into your CI pipeline to catch regressions automatically:
# .github/workflows/ai-tests.yml
name: AI Evaluation Tests
on:
push:
branches: [main]
pull_request:
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install deepeval ragas langsmith
- name: Run AI evaluation tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
run: |
deepeval test run tests/ai_evaluation.py
- name: Check quality gate
run: |
deepeval metrics --min-threshold 0.7
Set a quality gate that blocks merges when metrics fall below thresholds. This prevents gradual quality degradation that is invisible in traditional tests.
Conclusion
AI testing requires specialized frameworks that understand non-deterministic outputs. Use DeepEval for unit-test-style evaluation with configurable metrics. Use Ragas for end-to-end RAG pipeline evaluation. Use LangSmith for trace-level debugging and comparison across experiments. Integrate all three into CI pipelines with automated quality gates to maintain production AI quality over time.