LM Studio

/dictionary/lm-studio

Definition

A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.

Related terms

Posts that use this term

Troubleshooting local LLMs (and how to keep up after this series)
The full catalog of local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, bad RAG hits, tool-call hallucination. Plus where to follow the field once you're on your own.
Local agents and tool use
Function calling on open models in 2026. Which ones actually work, why local agents break when they break, and the scaffolding that keeps them upright.
Wiring a local LLM into the tools you already use
How to point VS Code (Continue, Cline), web chat UIs (Open WebUI, LibreChat, Page Assist), and your own code at a local model using the OpenAI-compatible API. Swap cloud for local without rewriting anything.
Every machine can run a local LLM (here's what fits)
A per-tier guide to running local LLMs in 2026, from 8GB integrated graphics to a 192GB Mac Studio. Specific models, specific speeds, specific configs.
System requirements by OS for local LLMs
What macOS, Linux, and Windows each need before you run a local LLM in 2026. Mac is the smoothest, Linux gives you the most knobs, and native Windows finally just works.
Streaming, throughput, and the KV cache
Why TTFT and tok/s are different numbers, why streaming feels faster than it is, and the KV cache that makes the 1000th token cost about the same as the first.
The local-LLM vocabulary
Parameters, B, dense vs MoE, base vs instruct, tokens, context windows, chat templates, GGUF, and quant suffixes. Read it once and any HuggingFace model card stops being scary.
The runtimes: llama.cpp, Ollama, LM Studio
llama.cpp is the engine. Ollama and LM Studio wrap it. What each one does, when to reach for which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.
Why Apple Silicon punches above its weight on local LLMs
Unified memory lets the GPU see all of RAM. Here's why that beats a discrete-GPU PC past 32B parameters, what fits in 16/32/64/128/192GB, and where Apple Silicon still loses.
What it takes to run a model on your own machine
Why VRAM is the one number that decides whether a local LLM runs, what quantization really does to a model file, and the hardware ladder from an 8GB laptop to a 192GB workstation.
Where AI actually runs: cloud, local, edge
When you use AI, a model file is sitting on a real machine. There are only three places it can be, and which one decides almost everything else.
Install the OpenAI SDK
Install the OpenAI SDK for Python and Node, set your API key, and prove it works with a one-line chat call.
Install LM Studio
Install LM Studio on macOS, Linux, and Windows, then flip on the local OpenAI-compatible server so any client library can talk to a model on your own machine.
Install llama.cpp
Build llama.cpp from source with Metal or CUDA, then run a GGUF model with llama-cli. The closest thing to bare-metal local inference.