Keywords AI
Reproducibility means: same prompt + same parameters → same output. In practice, cloud APIs and self-hosted stacks still show run-to-run drift due to sampling, batching/scheduling, and numeric quirks. A key insight from recent engineering work is batch invariance —making kernel results independent of batch size and scheduling—to collapse most of that drift (even at temperature=0).
The article Defeating Nondeterminism in LLM Inference explains which kernels matter (RMSNorm, matmul, attention) and shows measured trade-offs.
OpenAI's Chat Completions exposes a seed that helps repeat outputs when all inputs and parameters are identical (model snapshot, prompt bytes, temperature/top-p, penalties, tokens). It's a practical way to get "mostly deterministic" behavior.
Python (Chat Completions API)
from openai import OpenAI client = OpenAI() resp = client.chat.completions.create( model="gpt-4o-mini-2025-07-18", # pin the exact variant you deploy messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}], temperature=0, top_p=1, seed=42, # helps make outputs repeatable when all else is identical max_tokens=300, presence_penalty=0, frequency_penalty=0, ) print(resp.choices[0].message.content)
Gemini supports a seed, but docs say it's best-effort—deterministic output isn't guaranteed, and changing models or parameters can vary results even with the same seed.
Python (Vertex AI)
from vertexai import init from vertexai.generative_models import GenerativeModel, GenerationConfig init(project="YOUR_PROJECT", location="us-central1") model = GenerativeModel("gemini-1.5-pro") cfg = GenerationConfig( temperature=0, top_p=1, seed=42, # best-effort repeatability max_output_tokens=300, ) resp = model.generate_content("Explain CRDTs in 3 bullets.", generation_config=cfg) print(resp.text)
Claude exposes temperature/top-p(/top-k) but no official seed in the public API today. Minimize variance by pushing toward greedy decoding and holding everything else constant; pin exact model snapshots where your platform allows (e.g., Bedrock model IDs).
Python (Messages API)
from anthropic import Anthropic client = Anthropic() resp = client.messages.create( model="claude-3-7-sonnet-2025-05-xx", max_tokens=300, temperature=0, # minimize randomness top_p=1, messages=[{"role": "user", "content": "Explain CRDTs in 3 bullets."}], ) print(resp.content[0].text)
For local serving, vLLM provides a reproducibility guide. In current versions you typically (a) set per-request or global seeds and (b) turn off multiprocessing in V1 to make scheduling deterministic.
Python (engine + per-request seed)
import os from vllm import LLM, SamplingParams # Reduce scheduling nondeterminism in V1 engines: os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" llm = LLM(model="Qwen/Qwen2.5-7B-Instruct") sp = SamplingParams( temperature=0, top_p=1, seed=42, # per-request seed max_tokens=300, ) out = llm.generate(["Explain CRDTs in 3 bullets."], sampling_params=sp) print(out[0].outputs[0].text)
Under concurrency, the strongest gains come from batch-invariant kernels (RMSNorm/matmul/attention), which remove batch-size dependencies at some throughput cost. See the Thinking Machines article for design details and evidence.
Greedy decoding (do_sample=False) is deterministic when you keep the software/hardware stack fixed. Set framework seeds and opt into PyTorch's deterministic algorithms; PyTorch cautions that bit-identical results across devices/versions aren't promised.
Python
import os, random, numpy as np, torch from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed # 1) Seed everything SEED = 42 set_seed(SEED) random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED) # 2) Prefer deterministic algorithms (may reduce speed) torch.use_deterministic_algorithms(True) torch.backends.cudnn.benchmark = False # Optional (per PyTorch docs) for CUDA/cuBLAS: # os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") mdl = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").eval() inp = tok("Explain CRDTs in 3 bullets.", return_tensors="pt") out_ids = mdl.generate(**inp, max_new_tokens=300, do_sample=False) # greedy print(tok.decode(out_ids[0], skip_special_tokens=True))
Ollama's Modelfile/options supports a seed—"setting this to a specific number will make the model generate the same text for the same prompt." In practice, treat as best-effort and keep temperature/top-p/top-k and context length constant; pin your build.
REST (JSON body options)
curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "Explain CRDTs in 3 bullets.", "options": { "temperature": 0, "top_p": 1, "top_k": 1, "seed": 42 } }'
llama.cpp and llama-cpp-python expose a seed; for strict greedy decoding recent guidance is to set a single top-k sampler with k=1 (don't rely on temp=0 alone). As with other stacks, different backends/builds can still cause minor drift—pin binaries, drivers, and quantization.
CLI (llama.cpp)
./llama-cli -m ./models/llama3.gguf \ --seed 42 \ --sampling-seq k --top-k 1 \ --temp 0 \ -p "Explain CRDTs in 3 bullets."
Python (llama-cpp-python)
from llama_cpp import Llama llm = Llama(model_path="models/llama3.gguf", seed=42) resp = llm.create_completion( prompt="Explain CRDTs in 3 bullets.", max_tokens=300, temperature=0, top_p=1, top_k=1, ) print(resp["choices"][0]["text"])
At runtime, using the same built engine with the same inputs should be deterministic; however, engine building isn't deterministic (tactic selection can differ), and some performance features can lead to slightly different outputs under varying load. Reuse the exact engine file for repeatability.
Serve (conceptual example)
# Build once, reuse the same engine file for all runs: trtllm-build --checkpoint_dir ./weights --gpt_attention_plugin float16 --output_dir ./engines trtllm-serve --engine_dir ./engines --max_batch_size 1 # Then call your endpoint with temperature=0, top_p=1, fixed max tokens, etc.
SGLang is optimized for speed; its FAQ notes results may differ even at temperature=0 due to dynamic batching and prefix caching. If determinism matters, reduce concurrency features or run single-request batches; follow project issues for seed controls.
Launch (single-tenant style to reduce variance)
python -m sglang.launch_server \ --model-path meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 --port 8080 \ --context-length 8192 \ # keep load low; avoid dynamic batching if possible in your version
TGI serves Transformers models with high throughput. Determinism mainly comes from your client's generation params (e.g., greedy vs sampling) and a stable container/model snapshot; TGI itself doesn't promise determinism under batching. Keep decoding params fixed and pin images/tags.
cURL (OpenAI-style route exposed by TGI)
curl http://localhost:8080/generate \ -X POST -H "Content-Type: application/json" \ -d '{ "inputs": "Explain CRDTs in 3 bullets.", "parameters": { "do_sample": false, // greedy "max_new_tokens": 300 } }'