March 31, 20266 min read

What leaves your machine when you use AI

What providers actually see, log, and keep when you call an LLM API in 2026. What "we don't train on your data" really means, how free and paid tiers differ, and when local is the only safe choice.

Hit send on a cloud LLM call and every byte of your prompt walks out the door. The provider sees it. Their logs see it. And depending on what you signed, their training pipeline might see it too. This post is about what happens to those tokens the moment they leave you.

This is post 7 of 7 in the AI Running series, the last one. The other posts were about how to run AI. This one is about who's watching when you do.

The shape of an API call

API call lifecycle

When your code calls messages.create() or chat.completions.create(), here's what physically happens:

Your tokens are TLS-encrypted and sent to the provider's API endpoint.
They land in the provider's load balancer, get routed to an inference cluster.
The model runs, produces output tokens, streams them back.
The full request and response are written to logs.
Some of that data may persist for 30 days, 90 days, or longer, depending on the provider.

Step 4 is where the privacy story actually lives. Everyone fixates on the wire. The logs are the part that bites you.

What "we don't train on your data" actually means

Every major API provider says they don't train on your data by default. As of 2026:

Anthropic. API customer data is not used to train Claude models. Inputs and outputs are deleted within 30 days of the last response, unless flagged for legal hold or trust-and-safety review.
OpenAI. API data is not used to train models by default. Retained for up to 30 days for abuse monitoring; can be set to "zero data retention" for eligible enterprise customers.
Google. Vertex AI and Gemini API data is not used to train models for paid tiers. The free tier (AI Studio) is different and may be used for service improvement.

Read the actual contracts when it matters. The defaults are reasonable. But "we don't train on your data" is a much narrower promise than "we don't see your data." They see it. They log it. They have humans review flagged content for safety. Those humans sign NDAs, but they're real people and they exist.

There is no way to send data to a cloud API and have nobody, ever, see it under any circumstances. If your threat model needs that, a cloud API is off the table. That's the whole reason local LLMs exist as a category.

The free-tier problem

The defaults for free consumer products are not the defaults for paid APIs. This trips people up constantly.

ChatGPT free tier. OpenAI may use your conversations to improve models unless you opt out in settings. Most users never open settings.
Claude.ai free tier. Anthropic does not train on chat content by default, but the boundary here is "consumer product" versus "API customer," and different terms apply.
Gemini free / AI Studio. May be used for product improvement.

Paste work code, internal data, or anything sensitive into a free chat tool and you should assume it's now a candidate for the training pipeline. Opt out, or use the API.

The rough rule. Paid API: they see it, they don't train on it. Free chat: they see it, they might train on it. Local model: nobody sees it.

What gets logged

Across providers, here's what typically lands in the logs on every API call:

The full request body (prompt, tool definitions, all input).
The full response body (output tokens, tool calls).
Metadata: API key, IP address, timestamps, latency, token counts.
Safety classifier scores on inputs and outputs.
Any fine-tuning or batch job artifacts you uploaded.

Logs usually stick around for 30 days, sometimes longer for billing or abuse investigations. Enterprise contracts can negotiate this down, sometimes all the way to zero retention.

A useful frame: assume anything you send to a cloud API is in their logs for at least a month, and plan around that.

When local is the only safe choice

There are three cases where I won't touch a cloud API:

Regulated PII. Health records under HIPAA, certain financial data under DPDP / GDPR, biometrics. Even with a BAA in place, the right answer is usually "don't send it at all."
Trade secrets and pre-disclosure work. Unpublished research, M&A documents, source code under NDA from a client. The contract probably allows it, but the operational risk of a provider breach is never zero.
Anything you can't legally explain to an auditor. If you can't draw a clean diagram of where the data went, who saw it, and how long it stayed, the cloud API is the wrong tool.

Everything else is fine. Writing emails, drafting code that's going public anyway, summarizing public docs, building features for users who already know AI is in the loop. Use the cloud API and don't overthink it.

The hybrid pattern

A lot of production systems split traffic into tiers by privacy level:

Public, low-stakes content. Cloud API, frontier model. Best quality.
Internal but non-sensitive. Cloud API with enterprise terms (zero retention, no training). Still cloud, just contractually tighter.
Sensitive or regulated. Local model on owned infrastructure. Smaller, slower, never leaves the building.

The trick is routing each request to the right tier automatically. The person typing into your product doesn't know or care which model answered them. They want the right answer with the right privacy guarantee. The routing is your problem, not theirs.

This is also where the open-weights ecosystem pays for itself. Llama 4 70B on a Mac Studio handles the sensitive tier; Claude Opus 4.7 handles the rest. Same product, two engines, picked per request.

What "encrypted in transit" doesn't cover

TLS protects your data on the wire. It does not protect:

The server's process memory while inference runs.
The provider's logging pipeline.
The provider's database where logs persist.
Backup snapshots of those databases.
Subpoenas, warrants, or other legal compulsion against the provider.

Confidential computing (Intel SGX, AWS Nitro Enclaves) covers some of these for specific workloads, but no major LLM API runs in a verifiable enclave end to end as of 2026. Apple Private Cloud Compute is the closest production system to that goal, and it's only used for Apple Intelligence, not for general developer APIs.

So if "the provider's database admin can technically read this prompt" is unacceptable to you, you need a local model. No cloud API in 2026 closes that gap.

The take that ties the series together

The thread running through these posts is simple. AI in 2026 is plural, not singular. Cloud, local, and edge are different deployment shapes with different costs, capabilities, and privacy stories. The question is never "which model is best." It's "which model fits this request, on this hardware, under these constraints, for this user."

A working production system probably runs:

A frontier API (Claude, GPT, Gemini) for the cases where capability wins.
A local model (Llama, Qwen) for high-volume work, sensitive data, or offline cases.
An on-device model (Apple's Foundation Model, Gemini Nano) for latency-critical UX.

Build with all three in mind, route per request, and the bill, the privacy story, and the user experience all stay sane.

So next time you're about to paste something into a chat box, ask the one question that matters. Where does this leave my machine, and who reads it after? If the honest answer makes you wince, that's what local is for. Open the laptop, install Ollama, pull a 7B model, talk to it. That's exactly where the hands-on Local LLMs series picks up.

AI AI Running API LLM Local Models Privacy

From the dictionary

Terms used in this post

Quick reference for the 17 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Edge AIAI: Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.; e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

AI Running

7 / 7 posts

Browse all in AI Running →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Edge AIAI: Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

AI Running

7 / 7 posts

Browse all in AI Running →