Testing & Evals
@gemstack/ai-sdk ships three layers for proving an agent works:
AiFakeswaps the registered provider with a programmable mock, so you can test agent wiring with no API key and no network.- Observers let you subscribe to agent lifecycle events for tracing and metrics.
- The eval framework (
@gemstack/ai-sdk/eval) runs a suite of input cases plus assertions against real models, to prove the agent does the right thing, not just that it runs.
See also: Agents, Tools, Providers.
Testing with AiFake
AiFake.fake() replaces every registered provider with a mock and sets a default model, so your agent code runs unchanged with no real provider. Call .restore() afterward to put the real registry back, so tests do not leak between cases.
import { AiFake } from '@gemstack/ai-sdk'
const fake = AiFake.fake()
fake.respondWith('Mocked response')
const response = await new MyAgent().prompt('Hello')
// response.text === 'Mocked response'
fake.restore()respondWith(text) returns the same text for every call. The default before you set anything is 'fake response'.
Scripting a multi-step loop
When the agent loops (the model returns tool calls, then text, then more tool calls), script each step with respondWithSequence(...). Step N answers the agent's Nth provider call:
fake.respondWithSequence([
{ toolCalls: [{ id: 't1', name: 'lookup', arguments: { id: 42 } }] },
{ text: 'The answer is 42.' },
])When a step sets toolCalls, the finishReason defaults to 'tool_calls'; a text-only step defaults to 'stop'. Once the sequence is exhausted, later calls fall back to the respondWith default.
Forcing failures
failOnStep(stepIndex, error) throws on a specific iteration, to exercise failover and error paths. It is independent of the response sequence, so the order in which you call them does not matter:
fake.respondWithSequence([
{ toolCalls: [{ id: 't1', name: 'lookup', arguments: {} }] },
{ text: 'recovered' },
])
fake.failOnStep(0, new Error('Rate limited')) // first call throws; second succeedsFaking the other capabilities
The fake also covers the non-chat surfaces, each with a matching respondWith*:
| Method | Fakes |
|---|---|
respondWithImage(base64) | image generation |
respondWithAudio(buffer) | text-to-speech |
respondWithTranscription(text) | speech-to-text |
respondWithEmbedding(vectors) | embeddings |
respondWithRanking(results) | reranking |
respondWithFileUpload(result) | file uploads |
respondWithFileSearchResults(opts) | hosted file-search results |
Asserting on what was sent
After a run, assert on what the agent actually sent the provider:
const fake = AiFake.fake()
fake.respondWith('hi')
await new MyAgent().prompt('Hello there')
fake.assertPrompted(input => input.includes('Hello'))
fake.restore()The assertion helpers each take an optional predicate: assertPrompted, assertNothingPrompted, assertImageGenerated, assertAudioGenerated, assertTranscribed, assertEmbedded, assertReranked, and assertFileUploaded.
To make stray prompts a hard error (no ambient respondWith default), chain preventStrayPrompts() and script every expected call explicitly:
const fake = AiFake.fake().preventStrayPrompts()
fake.respondWithSequence([{ text: 'expected reply' }])
// any call beyond the scripted sequence throws instead of returning a defaultObservability
Subscribe to agent lifecycle events through the observer registry on the @gemstack/ai-sdk/observers subpath. This is the same surface a tracing or metrics collector hooks into:
import { aiObservers } from '@gemstack/ai-sdk/observers'
const unsubscribe = aiObservers.subscribe(event => {
if (event.kind === 'agent.step.completed') {
console.log(`step ${event.iteration}: ${event.tokens.total} cumulative tokens`)
}
if (event.kind === 'agent.completed') {
console.log(`done in ${event.duration}ms, ${event.steps.length} steps`)
}
})Event kinds:
agent.step.completedfires after each loop iteration, with that step's tools called, finish reason, and cumulative usage. Useful for streaming progress to a UI without waiting for the full run.agent.completedfires once after a successful run, with the full step history and final usage.agent.failedfires once, witherrorset, when the run throws or aborts.agent.eval.completedfires per eval case (see below), so collectors can aggregate pass-rate over time.
Each step's toolCalls[] carries a duration field (wall-clock milliseconds spent in the tool handler), so you can attribute latency to specific tools.
Evals against real models
AiFake proves the wiring works. Evals prove the agent does the right thing on real models. Define a suite of input cases and assertions, then run it. Eval suites use the same Agent instances as your app, so there is one source of truth.
// evals/support-agent.eval.ts
import { evalSuite, llmJudge, exactMatch, regex } from '@gemstack/ai-sdk/eval'
import { SupportAgent } from '../app/Agents/SupportAgent.js'
export default evalSuite('SupportAgent', {
agent: () => new SupportAgent(),
cases: [
{ name: 'password reset', input: 'How do I reset my password?',
assert: llmJudge('mentions a password reset link or email') },
{ name: 'price', input: 'How much?',
assert: exactMatch('$99/month') },
{ name: 'support email', input: 'Contact?',
assert: regex(/support@example\.com/) },
],
})Running a suite programmatically
evalSuite(...) returns a suite object; run it with runSuite(suite). It resolves to a SuiteReport with per-case pass/fail, score, tokens, cost, and duration. Pair it with a reporter:
import { runSuite, reportConsole, reportJson, reportHtml } from '@gemstack/ai-sdk/eval'
import suite from './evals/support-agent.eval.js'
const report = await runSuite(suite)
reportConsole(report) // human-readable summary to console
const envelope = reportJson(report) // CI-friendly JSON envelope
const html = reportHtml(report) // self-contained HTML stringDrive your own pass/fail gate off the report (for example, exit non-zero when report.failed > 0) so evals can run in CI without any framework CLI. reportConsole returns the report unchanged, so you can chain it inline.
Built-in metrics
An assertion is a Metric: (response, ctx) => MetricResult, where MetricResult is { pass, score?, reason? }. Sync or async both work. The built-ins:
exactMatch(string)andregex(RegExp)are surface checks onresponse.text.llmJudge(criterion, opts?)uses a small-model judge for fuzzy "did the answer mention X?" assertions.jsonShape(zodSchema)is a strict structural assertion: it strips code fences and runs zodsafeParse, surfacing the failing path.semanticMatch(reference, opts?)embeds the reference and response and compares by cosine similarity againstopts.threshold(default0.85). Requires a registered provider with embeddings.tokenCost(threshold)passes whenresponse.usage.totalTokens <= threshold, to catch prompt-size regressions.compose(...metrics)runs metrics in order with first-failure short-circuit, for examplecompose(jsonShape(Schema), tokenCost(800)).
User metrics are first-class: any (response, ctx) => MetricResult qualifies. The ctx carries the case input and caseName so a custom metric can log or branch on them.
Eval observability
runSuite emits an agent.eval.completed observer event after every case (including skipped ones), so a metrics collector subscribed to aiObservers can aggregate pass-rate per (suite, case) over time, exactly like the lifecycle events above.
The
evalSuite/runSuite/ metrics / reporters API documented here is the programmatic engine surface. The Rudder framework adds anai:evalCLI on top of it for record/replay and discovery; that command lives in@rudderjs/ai, not in this engine.