AI Testing Frameworks: DeepEval, Ragas, LangSmith, CI Integration


Introduction





Testing AI applications is fundamentally different from testing traditional software. LLM outputs are non-deterministic, there is no single correct answer, and failures manifest as subtle quality degradations rather than crashes. Dedicated AI testing frameworks address these challenges with automated evaluation metrics, test case generation, and regression detection.





DeepEval





DeepEval is an open-source testing framework designed for LLM applications:






from deepeval import assert_test


from deepeval.metrics import (


HallucinationMetric,


AnswerRelevancyMetric,


ContextualPrecisionMetric,


FaithfulnessMetric,


BiasMetric,


ToxicityMetric,


)


from deepeval.test_case import LLMTestCase




def test_rag_response_no_hallucination():


test_case = LLMTestCase(


input="What is the capital of France?",


actual_output="The capital of France is Paris, located in the Ile-de-France region.",


retrieval_context=["Paris is the capital and most populous city of France."],


)




hallucination_metric = HallucinationMetric(threshold=0.3)


assert_test(test_case, [hallucination_metric])




def test_response_relevancy():


test_case = LLMTestCase(


input="Explain Kubernetes pods",


actual_output="Kubernetes is a container orchestration platform...",


retrieval_context=[


"A pod is the smallest deployable unit in Kubernetes.",


"Pods can contain one or more containers.",


],


)




relevancy_metric = AnswerRelevancyMetric(threshold=0.7)


assert_test(test_case, [relevancy_metric])




# Run tests: deepeval test run test_ai.py







DeepEval supports 15+ evaluation metrics including hallucination detection, answer relevancy, faithfulness, contextual precision, bias detection, and toxicity scoring. Each metric returns a score (0-1) that can be compared against a configurable threshold.





Ragas





Ragas is specialized for evaluating RAG pipelines end-to-end:






from ragas import evaluate


from ragas.metrics import (


faithfulness,


answer_relevancy,


context_precision,


context_recall,


)


from datasets import Dataset




# Prepare evaluation dataset


test_data = Dataset.from_dict({


"question": [


"What is a Kubernetes pod?",


"How does load balancing work?",


],


"answer": [


"A pod is the smallest deployable unit...",


"Load balancing distributes traffic...",


],


"contexts": [


["Pods can contain one or more containers."],


["Load balancers distribute incoming traffic."],


],


"ground_truth": [


"A pod is the smallest deployable unit in Kubernetes.",


"Load balancing distributes network traffic across servers.",


],


})




# Compute RAG metrics


result = evaluate(


test_data,


metrics=[


faithfulness,


answer_relevancy,


context_precision,


context_recall,


],


)




print(result)


# {


# "faithfulness": 0.92,


# "answer_relevancy": 0.88,


# "context_precision": 0.95,


# "context_recall": 0.85


# }







Ragas decomposes RAG quality into four independent metrics: faithfulness (is the answer grounded in context?), answer relevancy (does the answer address the question?), context precision (are retrieved documents relevant?), and context recall (are all relevant documents retrieved?).





LangSmith





LangSmith provides a hosted evaluation platform with tracing and annotation:






from langsmith import Client, evaluate


from langsmith.schemas import Example, Run




client = Client()




# Define a custom evaluator


def answer_correctness(run: Run, example: Example) -> dict:


# Compare model output against expected output


predicted = run.outputs.get("output", "")


expected = example.outputs.get("answer", "")




# Use LLM-as-judge for evaluation


from langsmith.evaluation import evaluate as langsmith_eval


return {"score": compute_similarity(predicted, expected)}




# Run evaluation on a dataset


results = evaluate(


lambda inputs: my_llm_chain(inputs["question"]),


data="my-test-dataset",


evaluators=[answer_correctness],


experiment_prefix="rag-v2-eval",


)




# View results in LangSmith dashboard


print(results)







LangSmith excels at trace-level evaluation. Every LLM call, retrieval step, and prompt template is recorded and can be compared across experiments.





CI Integration





Integrate AI tests into your CI pipeline to catch regressions automatically:






# .github/workflows/ai-tests.yml


name: AI Evaluation Tests


on:


push:


branches: [main]


pull_request:




jobs:


evaluate:


runs-on: ubuntu-latest


steps:


- uses: actions/checkout@v4


- uses: actions/setup-python@v5


with:


python-version: "3.11"




- name: Install dependencies


run: pip install deepeval ragas langsmith




- name: Run AI evaluation tests


env:


OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}


LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}


run: |


deepeval test run tests/ai_evaluation.py




- name: Check quality gate


run: |


deepeval metrics --min-threshold 0.7







Set a quality gate that blocks merges when metrics fall below thresholds. This prevents gradual quality degradation that is invisible in traditional tests.





Conclusion





AI testing requires specialized frameworks that understand non-deterministic outputs. Use DeepEval for unit-test-style evaluation with configurable metrics. Use Ragas for end-to-end RAG pipeline evaluation. Use LangSmith for trace-level debugging and comparison across experiments. Integrate all three into CI pipelines with automated quality gates to maintain production AI quality over time.