What leaves your machine when you use AI
What providers actually see, log, and retain when you call an LLM API in 2026. What 'we don't train on your data' really means, free vs paid tier differences, and when local is the only safe option.
Blog
Posts on AI engineering, LLM systems, and software development.
What providers actually see, log, and retain when you call an LLM API in 2026. What 'we don't train on your data' really means, free vs paid tier differences, and when local is the only safe option.
How input vs output tokens are priced, why output is 5-6x more, what prompt caching saves you (10x), and the hidden costs (tokenizer drift, reasoning tokens, tool-call loops) that surprise people.
llama.cpp is the engine; Ollama and LM Studio wrap it. What each does, when to pick which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.
Unified memory means the GPU sees all of RAM. Why that beats discrete-GPU PCs above 32B parameters, what fits in 16/32/64/128/192GB, and where Apple Silicon still loses.
Why VRAM is the hard ceiling on local LLMs, what quantization actually does to a model file, and the practical hardware ladder from 8GB laptops to 192GB workstations.
A tour of the closed frontier models (Claude, GPT, Gemini) and the open weights (Llama, Qwen, DeepSeek, Mistral). What 'B' means, what each is good at, and which size to actually run.