What leaves your machine when you use AI
What providers actually see, log, and retain when you call an LLM API in 2026. What 'we don't train on your data' really means, free vs paid tier differences, and when local is the only safe option.

When you call a cloud LLM, every byte of your prompt leaves your machine. The provider sees it, their logs see it, and depending on the contract, their training pipeline might see it too. This post is about exactly what happens to those tokens after they leave you.
This is post 7 of 7 in the Running series. Six posts have been about how to run AI; this one is about who sees what when you do.
The shape of an API call

When your code calls messages.create() or chat.completions.create(), here's what physically happens:
- Your tokens are TLS-encrypted and sent to the provider's API endpoint.
- They land in the provider's load balancer, get routed to an inference cluster.
- The model runs, produces output tokens, streams them back.
- The full request and response are written to logs.
- Some of that data may persist for 30 days, 90 days, or longer, depending on the provider.
That logging step is where most of the privacy story actually lives.
What "we don't train on your data" actually means
Every major API provider says they don't train on your data by default. As of 2026:
- Anthropic. API customer data is not used to train Claude models. Inputs and outputs are deleted within 30 days of the last response, unless flagged for legal hold or trust-and-safety review.
- OpenAI. API data is not used to train models by default. Retained for up to 30 days for abuse monitoring; can be set to "zero data retention" for eligible enterprise customers.
- Google. Vertex AI and Gemini API data is not used to train models for paid tiers. Free tier (AI Studio) is different and may be used for service improvement.
Read the actual contracts when it matters. The defaults are reasonable, but "we don't train on your data" is a narrower claim than "we don't see your data". They see it. They log it. They have humans review flagged content for safety. Those humans are bound by NDAs but they exist.
There is no way to send data to a cloud API and have nobody, ever, see it under any circumstances. If your threat model requires that, you cannot use a cloud API. That is the entire reason local LLMs exist as a category.
The free-tier problem
The defaults are different for free consumer products vs paid APIs.
- ChatGPT free tier. OpenAI may use your conversations to improve models unless you opt out in settings. Most users don't opt out.
- Claude.ai free tier. Anthropic does not train on chat content by default, but the boundary is "consumer product" vs "API customer", and different terms apply.
- Gemini free / AI Studio. May be used for product improvement.
If you're pasting work code, internal data, or anything sensitive into a free chat tool, assume that data is in the provider's training pipeline candidate set. Opt out, or use the API.
The rough rule: paid API = "they see it, they don't train on it". Free chat = "they see it, they might train on it". Local model = "nobody sees it".
What gets logged
Across providers, here's what is typically logged on every API call:
- The full request body (prompt, tool definitions, all input).
- The full response body (output tokens, tool calls).
- Metadata: API key, IP address, timestamps, latency, token counts.
- Safety classifier scores on inputs and outputs.
- Any fine-tuning or batch job artifacts you uploaded.
Logs are usually retained for 30 days, sometimes longer for billing or abuse-investigation purposes. Enterprise contracts can negotiate this down, sometimes to zero retention.
A useful frame: assume anything you send to a cloud API is in their logs for at least a month. Plan accordingly.
When local is the only safe choice
There are three categories where I would not use a cloud API at all:
- Regulated PII. Health records under HIPAA, certain financial data under DPDP / GDPR, biometrics. Even with a BAA in place, the right answer is often "don't send it at all".
- Trade secrets and pre-disclosure work. Unpublished research, M&A documents, source code under NDA from a client. The contract probably allows it, but the operational risk of the provider being breached is non-zero.
- Anything you can't legally explain to an auditor. If you can't draw a clean diagram showing exactly where the data went, who saw it, and how long it stayed there, the cloud API isn't the right tool.
For everything else (writing emails, drafting code that will be public anyway, summarizing public docs, building features for users who already know AI is involved), the cloud API is fine. Use it.
The hybrid pattern
A lot of production systems run a tiered architecture by privacy level:
- Public, low-stakes content. Cloud API, frontier model. Best quality.
- Internal but non-sensitive. Cloud API with enterprise terms (zero retention, no training). Still cloud, but contractually tighter.
- Sensitive or regulated. Local model on owned infrastructure. Smaller, slower, but never leaves the building.
The trick is routing requests to the right tier automatically. The user typing into your product doesn't know or care which model handled their query. They just want the right answer with the right privacy guarantees. The routing logic is your problem.
This is also where the open-weights ecosystem earns its keep. Llama 4 70B running on a Mac Studio handles the sensitive tier; Claude Opus 4.7 handles the rest. Same product, two engines, picked per request.
What "encrypted in transit" doesn't cover
TLS protects your data on the wire. It does not protect:
- The server's process memory while inference runs.
- The provider's logging pipeline.
- The provider's database where logs persist.
- Backup snapshots of those databases.
- Subpoenas, warrants, or other legal compulsion against the provider.
Confidential computing (Intel SGX, AWS Nitro Enclaves) addresses some of these for specific workloads, but no major LLM API runs in a verifiable enclave end-to-end as of 2026. Apple Private Cloud Compute is the closest production system to that goal, and it's only used for Apple Intelligence, not for general developer APIs.
If "the provider's database admin can technically read this prompt" is unacceptable, you need a local model. There is no cloud API in 2026 that closes that gap.
The end-of-series take
Across this series, the main argument has been that AI in 2026 is plural, not singular. Cloud, local, and edge are different deployment shapes with different costs, capabilities, and privacy stories. The question is never "which model is best". It's "which model fits this request, on this hardware, under these constraints, for this user".
A working production system probably uses:
- A frontier API (Claude, GPT, Gemini) for the cases where capability dominates.
- A local model (Llama, Qwen) for high-volume work, sensitive data, or offline cases.
- An on-device model (Apple's Foundation Model, Gemini Nano) for latency-critical UX.
Build with all three in mind, route per request, and the bill, the privacy story, and the user experience all stay reasonable.
Where to go next
If this series got you to the point of running a model locally, the natural next step is the application layer: how to build with these models, how to wire them into agents, how to handle context and tools. That's a separate series, picking up wherever you stop.
For now: open the laptop, install Ollama, pull a 7B model, talk to it. Once a 4GB file on your SSD starts answering questions intelligently, the rest of the AI conversation looks different.
From the dictionary
Terms used in this post
Quick reference for the 17 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- APIGeneral
- Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
- ChatGPTAI
- OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
- ClaudeAI
- Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
- e.g. This blog's create-post skill drafts inline using Claude.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- Edge AIAI
- Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GeminiAI
- Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
- e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
- GPTAI
- OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
- InferenceML
- Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- OllamaAI
- A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
Where AI actually runs: cloud, local, edge
March 16, 2026
- 02
What leaves your machine when you use AI
March 31, 2026
- 03
LLM APIs and the economics of tokens
March 28, 2026
- 04
The runtimes: llama.cpp, Ollama, LM Studio
March 26, 2026
- 05
Why Apple Silicon punches above its weight on local LLMs
March 23, 2026
- 06
What it takes to run a model on your machine
March 21, 2026
- 07
The major LLMs in 2026
March 18, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...