AI Agents and Evaluation

Evaluation and controlled tool use help teams improve reliability without assuming the model is always correct or safe.

How to recognize this theme

How agents call tools and how teams measure and harden behavior.

In a daily board, this category groups terms by their shared role. Look for four cards that describe the same mechanism, risk area, or workflow rather than four words that merely sound similar.

Educational context

These entries are vocabulary notes for learning. They are not project endorsements, token recommendations, exchange rankings, or trading signals.

Function Calling

Function calling is a pattern where a model outputs a structured request to invoke a tool or API, separating reasoning from action execution and enabling more reliable integrations.

Tool Sandboxing

Tool sandboxing restricts what an agent's tools can access or modify (for example via allowlists, timeouts, and isolated environments) to reduce the blast radius of mistakes or attacks.

Evaluation Suite

An evaluation suite is a collection of repeatable tests and metrics used to measure a model or agent system across tasks, helping detect regressions and guide iteration.

Red Teaming

Red teaming is adversarial testing that probes a system for failure modes such as prompt injection, data leakage, or unsafe actions, with the goal of improving defenses and procedures.