Evaluating AI Agents – aicoursehub.io

About Course

In this course, you’ll build an AI agent, add observability to visualize and debug its steps, and evaluate its performance component-wise.

In detail, you’ll:

Distinguish between evaluating LLM-based systems and traditional software testing.
Explore the basic structure of AI agents – routers, skills, and memory – and implement an AI agent from scratch.
Add observability to the agent by collecting traces of the steps taken by the agent and visualizing the traces.
Choose the appropriate evaluator – code-based, LLM-as-a-Judge, and human annotations – for each component of the agent.
Set up evaluations for the skills and router decisions of the agent example using code-based and LLM-as-a-judge evaluators, by creating testing examples from collected traces and preparing detailed prompts for the LLM-as-a-judge.
Compute a convergence score to evaluate if the example agent can respond to a query in an efficient number of steps.
Run structured experiments to improve the performance of the agent by exploring changes to the prompt, LLM model, or the agent’s logic.
Understand how to deploy these evaluation techniques to monitor the agent’s performance in production.

By the end of this course, you’ll know how to trace AI agents, systematically evaluate them, and improve their performance.

Learn how to add observability to your agent to gain insights into its steps and know how to debug it.
Learn how to set up evaluations for the agent components by preparing testing examples, choosing the appropriate evaluator (code-based or LLM-as-a-Judge), and identifying the right metrics.
Learn how to structure your evaluation into experiments to iterate on and improve the output quality and the path taken by your agent.