AI Inference Scaling

These terms describe throughput, latency, and memory constraints that shape how AI systems are deployed.

How to recognize this theme

Terms used when optimizing model serving cost and speed.

In a daily board, this category groups terms by their shared role. Look for four cards that describe the same mechanism, risk area, or workflow rather than four words that merely sound similar.

Educational context

These entries are vocabulary notes for learning. They are not project endorsements, token recommendations, exchange rankings, or trading signals.

KV Cache

A KV cache stores intermediate key/value tensors from prior tokens so an autoregressive model can generate the next token faster without recomputing attention over the full history each step.

Token Throughput

Token throughput measures how many tokens a model server can process or generate per second under a given batch size, hardware, and latency target.

Inference Latency

Inference latency is the delay between sending a request to a model and receiving output, shaped by queueing, batch scheduling, compute speed, and decoding strategy.

Context Window

A context window is the maximum amount of text (in tokens) a model can consider at once, affecting how much prior conversation or documents can be used in one request.