Keywords AI
Modern LLM systems rarely execute as a single model call.
Yet many observability setups still treat a chat completion as the primary or only unit of execution. This works for simple demos, but breaks down once you introduce tools, memory, retries, or agents.
To understand real LLM behavior, you need to distinguish log/event types, spans, and traces and understand how they fit together.
This post explains that execution model and how to reason about it in practice.

Most systems start with logs. You track things like:
These logs tell you what happened, but they don't tell you what kind of thing happened.
In real LLM systems, a single request can produce many different types of events:
When you flatten all of these into a single log stream, you lose the execution structure.
A useful observability model starts by typing events. This means categorizing what happened.
| Event type | What it represents |
|---|---|
| Model inference | A single LLM call |
| Tool call | Invocation of an external tool or function |
| Memory access | Reads/writes to state, cache, or vector stores |
| Control flow | Branches, retries, fallbacks |
| Error | Failures, timeouts, malformed outputs |
Typed events answer a simple question:
What kind of thing just happened?
This distinction is the foundation of an LLM observability data model. For a formal definition of these concepts, see the Observability Data Model.
A span is not a log entry.
A span represents a bounded unit of execution. Think of it like this:
A chat completion often contains multiple logical steps:
When you treat the entire request as a single span, you collapse all of this into a black box.
In practice, spans map to execution boundaries, not outputs:
Each span can emit multiple logs or events, but the span defines where execution begins and ends.
A trace is not a timeline of logs.
A trace is a group of related spans that together represent one request's execution.
This structure preserves important information:

In this trace:
It helps to be explicit about the role each concept plays:
| Concept | Answers |
|---|---|
| Log / Event | What happened? |
| Span | What unit of execution ran? |
| Trace | How did execution flow? |
Here's the thing: logs without spans lose boundaries. Spans without traces lose structure. Traces without typed events lose meaning.
Agent and workflow-based systems make this distinction unavoidable.
A single request might:

Once execution is modeled explicitly, things get clearer:
Instead of asking:
"Why was the output bad?"
You can ask:
"Which span failed or behaved differently?"
This shift is what makes observability actually useful.
The correct mental model is not:
one request → one completion → one log
It is:
one request → many typed events → bounded spans → one trace
This structure reflects how LLM systems actually run in production.
The full execution model, including event types, spans, traces, and relationships, is defined in the Observability Data Model documentation.
Chat completions are still important. They're just one event among many.
If you only log completions, LLM behavior will always feel opaque. But if you model execution with typed events, spans, and traces, failures stop being mysterious.
Observability isn't about collecting more logs. It's about modeling execution correctly.