AI Inference Costs

These concepts describe cached prompts, grouped model work, token usage limits, and latency targets for AI systems.

How to recognize this theme

Terms used when model calls must be measured for cost and responsiveness.

In a daily board, this category groups terms by their shared role. Look for four cards that describe the same mechanism, risk area, or workflow rather than four words that merely sound similar.

Educational context

These entries are vocabulary notes for learning. They are not project endorsements, token recommendations, exchange rankings, or trading signals.

Prompt Cache Hit

A prompt cache hit occurs when previously processed prompt content can be reused, reducing repeated model work.

Batch Inference Job

A batch inference job groups many model requests so they can be processed together, often to improve throughput or cost efficiency.

Token Budget Meter

A token budget meter tracks how many input and output tokens an AI workflow is allowed to consume.

Latency Service Level

A latency service level is a target for how quickly a model or AI service should respond under agreed conditions.