Inference

The process of running an AI model to generate output from input. Every AI response is an inference call.

Inference is the phase where a trained AI model processes input and generates output. Training teaches the model patterns from data; inference applies those patterns to new inputs. Every time you send a message to Claude, GPT, or Gemini and get a response, that is an inference call.

Inference has practical implications for builders: it costs money (API pricing is per-inference), it takes time (latency ranges from milliseconds to minutes depending on the model and output length), and it can fail (rate limits, timeouts, model errors). Designing your app to handle these gracefully is part of building AI features.

Key concepts: latency (time to first token), throughput (tokens per second), streaming (displaying output as it is generated rather than waiting for completion), and batching (sending multiple requests efficiently).

Related Courses

Make Your App Smarter

Links open the course details directly on the Courses page.

← View all glossary terms

Inference

Related Courses

Related Terms