Keywords AI

BLOG

How to get consistent and reproducible LLM outputs in 2025 (OpenAI, Gemini, Claude, vLLM...)

How to get consistent and reproducible LLM outputs in 2025 (OpenAI, Gemini, Claude, vLLM...)

September 15, 2025

Reproducibility means: same prompt + same parameters → same output. In practice, cloud APIs and self-hosted stacks still show run-to-run drift due to sampling, batching/scheduling, and numeric quirks. A key insight from recent engineering work is batch invariance —making kernel results independent of batch size and scheduling—to collapse most of that drift (even at temperature=0).

The article Defeating Nondeterminism in LLM Inference explains which kernels matter (RMSNorm, matmul, attention) and shows measured trade-offs.


Why Outputs Drift Even with "Greedy" Settings

  • Sampling & ties: temperature=0 reduces variance but doesn't guarantee identical outputs in all stacks.
  • Provider & model updates: The "same" model name can point to a new snapshot.
  • Batching & server load: Different batch sizes or scheduling can change numeric paths—hence the push for batch-invariant kernels.
  • Numeric/hardware differences: Framework docs note that full determinism across devices/versions isn't guaranteed; deterministic algorithms just tighten it.

OpenAI (ChatGPT): "Mostly" Reproducible with seed

OpenAI's Chat Completions exposes a seed that helps repeat outputs when all inputs and parameters are identical (model snapshot, prompt bytes, temperature/top-p, penalties, tokens). It's a practical way to get "mostly deterministic" behavior.

Python (Chat Completions API)

from openai import OpenAI
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o-mini-2025-07-18",  # pin the exact variant you deploy
    messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
    temperature=0,
    top_p=1,
    seed=42,               # helps make outputs repeatable when all else is identical
    max_tokens=300,
    presence_penalty=0,
    frequency_penalty=0,
)
print(resp.choices[0].message.content)

Google Gemini (Vertex AI): seed Is "Best Effort"

Gemini supports a seed, but docs say it's best-effort—deterministic output isn't guaranteed, and changing models or parameters can vary results even with the same seed.

Python (Vertex AI)

from vertexai import init
from vertexai.generative_models import GenerativeModel, GenerationConfig

init(project="YOUR_PROJECT", location="us-central1")
model = GenerativeModel("gemini-1.5-pro")

cfg = GenerationConfig(
    temperature=0,
    top_p=1,
    seed=42,               # best-effort repeatability
    max_output_tokens=300,
)

resp = model.generate_content("Explain CRDTs in 3 bullets.", generation_config=cfg)
print(resp.text)

Anthropic Claude: No Official seed; Minimize Randomness

Claude exposes temperature/top-p(/top-k) but no official seed in the public API today. Minimize variance by pushing toward greedy decoding and holding everything else constant; pin exact model snapshots where your platform allows (e.g., Bedrock model IDs).

Python (Messages API)

from anthropic import Anthropic
client = Anthropic()

resp = client.messages.create(
    model="claude-3-7-sonnet-2025-05-xx",
    max_tokens=300,
    temperature=0,   # minimize randomness
    top_p=1,
    messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}],
)
print(resp.content[0].text)

vLLM (Self-Hosted): Seeds + Scheduling Controls

For local serving, vLLM provides a reproducibility guide. In current versions you typically (a) set per-request or global seeds and (b) turn off multiprocessing in V1 to make scheduling deterministic.

Python (engine + per-request seed)

import os
from vllm import LLM, SamplingParams

# Reduce scheduling nondeterminism in V1 engines:
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
sp = SamplingParams(
    temperature=0,
    top_p=1,
    seed=42,                  # per-request seed
    max_tokens=300,
)
out = llm.generate(["Explain CRDTs in 3 bullets."], sampling_params=sp)
print(out[0].outputs[0].text)

Under concurrency, the strongest gains come from batch-invariant kernels (RMSNorm/matmul/attention), which remove batch-size dependencies at some throughput cost. See the Thinking Machines article for design details and evidence.


Hugging Face Transformers + PyTorch: Greedy + Deterministic Algorithms

Greedy decoding (do_sample=False) is deterministic when you keep the software/hardware stack fixed. Set framework seeds and opt into PyTorch's deterministic algorithms; PyTorch cautions that bit-identical results across devices/versions aren't promised.

Python

import os, random, numpy as np, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

# 1) Seed everything
SEED = 42
set_seed(SEED)
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# 2) Prefer deterministic algorithms (may reduce speed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.benchmark = False
# Optional (per PyTorch docs) for CUDA/cuBLAS:
# os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
mdl = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").eval()

inp = tok("Explain CRDTs in 3 bullets.", return_tensors="pt")
out_ids = mdl.generate(**inp, max_new_tokens=300, do_sample=False)  # greedy
print(tok.decode(out_ids[0], skip_special_tokens=True))

Ollama: Deterministic Generation with seed (Best-Effort)

Ollama's Modelfile/options supports a seed—"setting this to a specific number will make the model generate the same text for the same prompt." In practice, treat as best-effort and keep temperature/top-p/top-k and context length constant; pin your build.

REST (JSON body options)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain CRDTs in 3 bullets.",
  "options": {
    "temperature": 0,
    "top_p": 1,
    "top_k": 1,
    "seed": 42
  }
}'

llama.cpp / llama-cpp-python: Seed + Strict Greedy

llama.cpp and llama-cpp-python expose a seed; for strict greedy decoding recent guidance is to set a single top-k sampler with k=1 (don't rely on temp=0 alone). As with other stacks, different backends/builds can still cause minor drift—pin binaries, drivers, and quantization.

CLI (llama.cpp)

./llama-cli -m ./models/llama3.gguf \
  --seed 42 \
  --sampling-seq k --top-k 1 \
  --temp 0 \
  -p "Explain CRDTs in 3 bullets."

Python (llama-cpp-python)

from llama_cpp import Llama
llm = Llama(model_path="models/llama3.gguf", seed=42)

resp = llm.create_completion(
    prompt="Explain CRDTs in 3 bullets.",
    max_tokens=300,
    temperature=0,
    top_p=1,
    top_k=1,
)
print(resp["choices"][0]["text"])

NVIDIA TensorRT-LLM: Reuse the Same Engine + Fixed Inputs

At runtime, using the same built engine with the same inputs should be deterministic; however, engine building isn't deterministic (tactic selection can differ), and some performance features can lead to slightly different outputs under varying load. Reuse the exact engine file for repeatability.

Serve (conceptual example)

# Build once, reuse the same engine file for all runs:
trtllm-build --checkpoint_dir ./weights --gpt_attention_plugin float16 --output_dir ./engines
trtllm-serve --engine_dir ./engines --max_batch_size 1
# Then call your endpoint with temperature=0, top_p=1, fixed max tokens, etc.

SGLang: Fast Server, Acknowledged Non-Determinism Under Load

SGLang is optimized for speed; its FAQ notes results may differ even at temperature=0 due to dynamic batching and prefix caching. If determinism matters, reduce concurrency features or run single-request batches; follow project issues for seed controls.

Launch (single-tenant style to reduce variance)

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 --port 8080 \
  --context-length 8192 \
  # keep load low; avoid dynamic batching if possible in your version

Hugging Face TGI (Text Generation Inference): Client-Side Control

TGI serves Transformers models with high throughput. Determinism mainly comes from your client's generation params (e.g., greedy vs sampling) and a stable container/model snapshot; TGI itself doesn't promise determinism under batching. Keep decoding params fixed and pin images/tags.

cURL (OpenAI-style route exposed by TGI)

curl http://localhost:8080/generate \
  -X POST -H "Content-Type: application/json" \
  -d '{
    "inputs": "Explain CRDTs in 3 bullets.",
    "parameters": {
      "do_sample": false,   // greedy
      "max_new_tokens": 300
    }
  }'

Further Reading

  • Defeating Nondeterminism in LLM Inference — batch-invariant kernels and why batching causes drift.
  • OpenAI Cookbook: Reproducible outputs with seed — how to use seed and keep other params fixed.
  • Google Vertex AI: seed is best-effort — helps, but not a guarantee.
  • vLLM Reproducibility — seeds + scheduling controls for self-hosted.
  • PyTorch Reproducibility Notes — what deterministic algorithms do (and don't) guarantee.
  • Ollama Modelfile Referenceseed option and other knobs.
  • llama-cpp-python API / llama.cpp guidance — seed parameter and strict greedy setup.
  • TensorRT-LLM & TensorRT forums — reuse the same engine; building isn't deterministic.
  • SGLang FAQ — why results can differ even at temperature=0.
About Keywords AIKeywords AI is the leading developer platform for LLM applications.
Keywords AIPowering the best AI startups.