Blog

Engineering

Posts on AI engineering, LLM systems, and software development.

Sort:

Local LlmsMay 15, 2026#13

Troubleshooting local LLMs (and how to keep up after this series)

The full catalog of local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, bad RAG hits, tool-call hallucination. Plus where to follow the field once you're on your own.

AI Debugging LLM Local Llms Troubleshooting

Read →

Local LlmsMay 12, 2026#12

Fine-tuning a model locally

When fine-tuning is actually the right call (it usually isn't) and how to pull off a LoRA run on a 16GB Mac, with a worked Llama 3.2 3B example.

AI Fine Tuning LLM Local Llms Lora

Read →

Local LlmsMay 8, 2026#11

Local agents and tool use

Function calling on open models in 2026. Which ones actually work, why local agents break when they break, and the scaffolding that keeps them upright.

Agents AI Function Calling LLM Local Llms

Read →

Local LlmsMay 5, 2026#10

Local RAG and embeddings

Build a working local RAG pipeline in about 30 lines using nomic-embed-text, Chroma, and Llama 3.2. And why running it on your own machine beats the cloud for personal notes.

AI Chroma Embeddings LLM Local Llms

Read →

Local LlmsMay 1, 2026#9

Wiring a local LLM into the tools you already use

How to point VS Code (Continue, Cline), web chat UIs (Open WebUI, LibreChat, Page Assist), and your own code at a local model using the OpenAI-compatible API. Swap cloud for local without rewriting anything.

AI Integration LLM Local Llms Vscode

Read →

Local LlmsApril 28, 2026#8

Your first local LLM, start to finish

Install Ollama, pull Llama 3.2 3B, chat with it, hit its API, and fix the five things that break on a first install. You finish with a working local LLM.

AI LLM Local Llms Ollama Setup

Read →

Local LlmsApril 24, 2026#7

Every machine can run a local LLM (here's what fits)

A per-tier guide to running local LLMs in 2026, from 8GB integrated graphics to a 192GB Mac Studio. Specific models, specific speeds, specific configs.

AI Hardware LLM Local Llms VRAM

Read →

Local LlmsApril 21, 2026#6

System requirements by OS for local LLMs

What macOS, Linux, and Windows each need before you run a local LLM in 2026. Mac is the smoothest, Linux gives you the most knobs, and native Windows finally just works.

AI Linux LLM Local Llms Macos

Read →

Local LlmsApril 17, 2026#5

Picking a local model by task

The 2026 open leaders, sorted by what you actually want to do: coding, chat, the small-model crowd, structured output, vision, embeddings, and audio.

AI Coding LLM Local Llms Model Selection

Read →

Local LlmsApril 14, 2026#4

Streaming, throughput, and the KV cache

Why TTFT and tok/s are different numbers, why streaming feels faster than it is, and the KV cache that makes the 1000th token cost about the same as the first.

AI Inference Kv Cache LLM Local Llms

Read →

Local LlmsApril 10, 2026#3

Quantization, distillation, pruning: how a 140GB model fits on your laptop

Three ways to shrink an LLM, and why one of them does almost all the work. What Q4_K_M actually means and what each shortcut costs you.

AI Distillation GGUF LLM Local Llms

Read →

Local LlmsApril 7, 2026#2

The local-LLM vocabulary

Parameters, B, dense vs MoE, base vs instruct, tokens, context windows, chat templates, GGUF, and quant suffixes. Read it once and any HuggingFace model card stops being scary.

AI GGUF LLM Local Llms Vocabulary

Read →