Inference

/dictionary/inference

Definition

Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.

Posts that use this term

Fine-tuning a model locally
When fine-tuning is actually the right call (it usually isn't) and how to pull off a LoRA run on a 16GB Mac, with a worked Llama 3.2 3B example.
Streaming, throughput, and the KV cache
Why TTFT and tok/s are different numbers, why streaming feels faster than it is, and the KV cache that makes the 1000th token cost about the same as the first.
Quantization, distillation, pruning: how a 140GB model fits on your laptop
Three ways to shrink an LLM, and why one of them does almost all the work. What Q4_K_M actually means and what each shortcut costs you.
The local-LLM vocabulary
Parameters, B, dense vs MoE, base vs instruct, tokens, context windows, chat templates, GGUF, and quant suffixes. Read it once and any HuggingFace model card stops being scary.
What leaves your machine when you use AI
What providers actually see, log, and keep when you call an LLM API in 2026. What "we don't train on your data" really means, how free and paid tiers differ, and when local is the only safe choice.
LLM API bills, and why a token costs what it costs
How input and output tokens get priced, why output runs 5-6x more, and how prompt caching cuts the input bill by 10x. Plus the hidden costs that ambush people.
The runtimes: llama.cpp, Ollama, LM Studio
llama.cpp is the engine. Ollama and LM Studio wrap it. What each one does, when to reach for which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.
Why Apple Silicon punches above its weight on local LLMs
Unified memory lets the GPU see all of RAM. Here's why that beats a discrete-GPU PC past 32B parameters, what fits in 16/32/64/128/192GB, and where Apple Silicon still loses.
What it takes to run a model on your own machine
Why VRAM is the one number that decides whether a local LLM runs, what quantization really does to a model file, and the hardware ladder from an 8GB laptop to a 192GB workstation.
The major LLMs in 2026
A field guide to the closed frontier models and the open weights you can actually run. What the "B" numbers mean, and which size fits your machine.
Where AI actually runs: cloud, local, edge
When you use AI, a model file is sitting on a real machine. There are only three places it can be, and which one decides almost everything else.
Prompt, RAG, fine-tune: three ways to shape a model
Three levers for shaping what an LLM does: prompting (ask better), RAG (give it the right context), fine-tuning (change the weights). What each costs, what each fixes, and how to pick.
How a model learns: training and inference
Training is the expensive one-time event where a model's numbers get tuned. Inference is the cheap repeated use afterwards. The gap in cost is enormous, and it shapes the whole industry.
Install llama.cpp
Build llama.cpp from source with Metal or CUDA, then run a GGUF model with llama-cli. The closest thing to bare-metal local inference.