April 21, 20266 min read

System requirements by OS for local LLMs

What macOS, Linux, and Windows each need before you run a local LLM in 2026. Mac is the smoothest, Linux gives you the most knobs, and native Windows finally just works.

A local LLM runs on every modern operating system. What changes from one to the next is what each one wants from your hardware and what you type to install it. So this post is the checklist: what Mac, Linux, and Windows each need before you pull your first model.

This is post 6 of 13 in the Local LLMs series. By the end you'll know which boxes your specific machine ticks, and what to install before we go further.

The bar is lower than you think

Per-OS install paths

Whatever OS you're on, here's the floor:

8 GB RAM, ideally 16 GB. Under 8 and you're stuck with the smallest models (Phi-3 mini, Llama 3.2 1B). At 16, the practical universe opens up.
20 GB free disk. Models are big files. A starter set of three runs about 10 GB, and you'll keep adding.
64-bit OS, modern processor. Nothing from the last 8 years is too old. M1/M2/M3 Macs, AMD Ryzen 3000+, Intel Core 8th gen and up, any recent AMD/Intel desktop. All fine.

Here's the honest version. If your laptop runs a current browser without choking, it can run a local LLM. The minimums aren't the wall people picture.

macOS, the easy road

This is the path of least resistance in 2026. Apple Silicon's unified memory means the GPU can see all of your RAM, which the Running series covered in detail. What a Mac wants:

M1 or newer. The old Intel Macs do work, but they're 5 to 10 times slower. Not worth chasing for this.
macOS 13 (Ventura) or newer. Most runtimes target 14+ by now.
Memory. 8GB is the bare minimum and limits you to 1B to 3B models. 16GB is the comfortable floor. 32GB and up is where it gets fun.
Backend: Metal. Apple's GPU programming model. Every major local-LLM runtime speaks it natively.

What you install:

# install homebrew if you don't have it (one-liner from brew.sh)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# install ollama via homebrew

brew install ollama

That's the whole thing. No driver, no kernel module, no CUDA installer. Done.

For LM Studio, grab the .dmg from lmstudio.ai and drag it to Applications.

One note on the M-series. M3 and M4 beat M2 beats M1, sure, but the unified-memory advantage matters more than the generation. A 16GB M1 will outrun a 16GB Intel Mac with a discrete GPU by a wide margin. If you've got any M-series Mac, you're in great shape.

Linux, the most knobs

The most flexible OS for this, and the one most production deployments actually run on. You get three GPU paths plus a fallback:

NVIDIA GPU + CUDA. The mainstream route. RTX 30, 40, 50 series consumer cards, and Tesla / A / H series datacenter cards. CUDA 12.1+ recommended.
AMD GPU + ROCm. AMD's answer to CUDA. It got a lot better through 2025 and 2026, but it's still rougher than CUDA. RX 7900 XT/XTX, MI300 datacenter cards. Set HSA_OVERRIDE_GFX_VERSION for older cards.
Vulkan backend. Cross-vendor GPU compute. Works on AMD, Intel Arc, even integrated Intel/AMD graphics. Slower than CUDA or ROCm, but it runs on hardware nothing else will touch.
CPU only. Always there. 2 to 8 tok/s on a 7B model. Fine for a quick test, painful for real work.

NVIDIA on Ubuntu is the smoothest combination. If you're picking fresh, pick that.

What you install (Ubuntu 22.04+ on NVIDIA):

# install nvidia driver if not already

sudo apt install nvidia-driver-550

# install ollama

curl -fsSL https://ollama.com/install.sh | sh

Ollama's installer detects CUDA, pulls the right binary, and registers a systemd service. After that, ollama list works and you can start pulling models.

For ROCm, you'll want AMD's rocm-install script plus llama.cpp built with LLAMA_HIPBLAS=1. Or just use LM Studio's Linux build, which handles ROCm internally.

Windows, which finally just works

The OS most engineers actually use, and the one most local-LLM tutorials quietly skip. In 2026 there are two paths that work. Which one's right depends on how much you live in a terminal.

Path 1: native Windows (recommended for most)

The fastest, simplest setup. Both Ollama and LM Studio ship native Windows binaries.

What you install:

NVIDIA driver (any recent one). RTX 30/40/50 cards work out of the box.
Ollama for Windows. Download from ollama.com, run the installer. It registers as a Windows service.
LM Studio for Windows. Download the .exe from lmstudio.ai.

Both detect your GPU on their own. No WSL, no Linux subsystem, no compiling anything. From a clean Windows 11 install you can have a model talking back to you in 10 minutes.

# verify the install (PowerShell)

ollama run llama3.2:3b

The single best surprise of 2025 for Windows users: native local LLMs now run as smoothly as they do on macOS.

Path 2: WSL2 + Ubuntu (for Linux power users on Windows)

If you basically live in Linux tools and just happen to be on a Windows box, WSL2 is the move. NVIDIA's CUDA driver passes GPU access straight through to the WSL2 kernel, automatically, as of 2024.

Install:

# enable wsl2 and install ubuntu

wsl --install

# inside the wsl2 ubuntu, install ollama

curl -fsSL https://ollama.com/install.sh | sh

From there it's identical to native Linux. Performance lands within 5% of bare-metal Linux on the same hardware.

What about AMD on Windows?

ROCm on Windows is getting better but it's still rough as of mid-2026. If you've got an AMD GPU on Windows, the path that actually works is LM Studio's Vulkan backend. No custom drivers needed. Slower than CUDA, but it runs.

The stuff that's the same everywhere

A few things don't care what OS you're on:

GGUF files are portable. A model you downloaded on a Mac runs on Linux runs on Windows.
Ollama's daemon protocol is the same. A client on your Mac can talk to an Ollama server running on a Linux box across your home network.
LM Studio's UI is identical. Same buttons, same model browser.

A 30-second sanity check

Before you go pulling models, confirm the install took:

# check ollama is installed

ollama --version

# pull and run the smallest useful model

ollama run llama3.2:1b "say hi"

Get "say hi" back (or some friendly version of it) and you're good. If the model takes more than 30 seconds to download or to first token, the bottleneck is your network or your hardware, not the install.

Where each OS trips people up

Mac, "the model is slow." Open Activity Monitor and check the GPU tab. If the GPU sits at 0%, your runtime is on CPU. Check the Ollama or LM Studio settings.
Linux, "no GPU detected." nvidia-smi should show your card. If it doesn't, the driver isn't loaded. Try lsmod | grep nvidia, and reboot after a driver install.
Windows native, "Ollama can't find the GPU." Update the NVIDIA driver and reboot. Then watch Task Manager, Performance, GPU, and confirm CUDA usage climbs when a model runs.
WSL2, "everything's slow." Make sure you're on WSL2 and not WSL1: wsl --list --verbose. WSL1 has no GPU access at all.

What's next

OS sorted, drivers verified. Next up is the per-tier hardware tour: exactly what runs comfortably on a 16GB MacBook Air, on a 24GB GPU, on an old laptop with integrated graphics. Concrete model picks for every common machine.

AI Linux LLM Local Llms Macos Setup Windows

From the dictionary

Terms used in this post

Quick reference for the 8 terms you met above. Each one comes from the AI dictionary.

GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.

Rate this article

How helpful did you find this?

Series

Local Llms

6 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.

Series

Local Llms

6 / 13 posts

Browse all in Local Llms →