Resources

Run AI Locally

A no-nonsense guide to running LLMs on your own hardware. Full privacy, zero API costs, works offline. Here's exactly what you need and how to set it up.

Who this is for

I've been running local models for over a year on everything from a budget Linux box to an M3 Max MacBook. This guide covers the practical side: what hardware you actually need, which software to use, and which models are worth downloading. If you're tired of API costs, privacy concerns, or rate limits, this is your starting point.

Why run locally

Privacy

Your data never leaves your machine. No third-party API logging, no training on your inputs. Critical for handling sensitive documents, client data, or classified material.

Cost

After the hardware investment, every inference is free. Run thousands of queries per day without watching a billing dashboard. Pays for itself within months for heavy users.

Speed

No network latency, no rate limits, no queue times. Local inference on a good GPU delivers responses in seconds. Perfect for batch processing and tooling integration.

Offline Access

Works on planes, in air-gapped environments, and during outages. Your AI assistant doesn't need the internet. Essential for field work and secure facilities.

Hardware requirements

You do not need a data center. Click each tier to see the details.

Software options

Four tools, each good at different things. If you are starting from scratch, go with Ollama.

Model recommendations

Matched to use case. Sizes listed are for the quantized (Q4_K_M) variant you will actually download.

Use Case	Model	Size	Why
General chat	Llama 3.1 8B	4.7GB	Fast, capable, great instruction following
Code assistance	CodeLlama 34B	19GB	Trained on code, good at generation and review
Security analysis	Mixtral 8x7B	26GB	Strong reasoning, handles technical analysis well
Document Q&A	Llama 3.1 70B (Q4)	40GB	Best open-source quality, needs good GPU
Quick tasks	Phi-3 Mini	2.3GB	Tiny but surprisingly capable for simple tasks
Privacy-critical	Mistral 7B	4.1GB	Good balance of size and capability, runs anywhere

Security considerations

Before you download anything

Only download models from trusted sources. Verify checksums.

Model files can contain arbitrary code via pickle deserialization attacks. Stick to official repos on Hugging Face, Ollama's library, or direct downloads from model authors.

Add your own guardrails if deploying to a team.

Local models have no content filtering by default. Unlike hosted APIs, there is no safety layer between the model and your users. Build one if you need one.

Prefer GGUF format over pickle-based formats.

Quantized models in GGUF format are structurally safer than pickle-based formats because GGUF does not support arbitrary code execution during loading.

Use encrypted storage if processing sensitive data.

The model weights themselves are not sensitive, but cached contexts, conversation logs, and KV caches might contain sensitive data from your prompts.

Quick start

From zero to chatting with a local LLM in under five minutes. Copy and paste these commands.

terminal

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (Llama 3.1 8B, ~4.7GB download)
ollama pull llama3.1

# Start chatting
ollama run llama3.1

# Or use the API
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Explain XSS in 3 sentences"}'

Want more hands-on guides?

I write about AI, security tooling, and practical infrastructure on this blog. Real setups, real numbers, no vendor pitches.

Read the blog

Who this is for

Why run locally

Privacy

Your data never leaves your machine. No third-party API logging, no training on your inputs. Critical for handling sensitive documents, client data, or classified material.

Cost

After the hardware investment, every inference is free. Run thousands of queries per day without watching a billing dashboard. Pays for itself within months for heavy users.

Speed

No network latency, no rate limits, no queue times. Local inference on a good GPU delivers responses in seconds. Perfect for batch processing and tooling integration.

Offline Access

Works on planes, in air-gapped environments, and during outages. Your AI assistant doesn't need the internet. Essential for field work and secure facilities.

Software options

Four tools, each good at different things. If you are starting from scratch, go with Ollama.

Model recommendations

Matched to use case. Sizes listed are for the quantized (Q4_K_M) variant you will actually download.

Use Case	Model	Size	Why
General chat	Llama 3.1 8B	4.7GB	Fast, capable, great instruction following
Code assistance	CodeLlama 34B	19GB	Trained on code, good at generation and review
Security analysis	Mixtral 8x7B	26GB	Strong reasoning, handles technical analysis well
Document Q&A	Llama 3.1 70B (Q4)	40GB	Best open-source quality, needs good GPU
Quick tasks	Phi-3 Mini	2.3GB	Tiny but surprisingly capable for simple tasks
Privacy-critical	Mistral 7B	4.1GB	Good balance of size and capability, runs anywhere

Security considerations

Before you download anything

Only download models from trusted sources. Verify checksums.

Model files can contain arbitrary code via pickle deserialization attacks. Stick to official repos on Hugging Face, Ollama's library, or direct downloads from model authors.

Add your own guardrails if deploying to a team.

Local models have no content filtering by default. Unlike hosted APIs, there is no safety layer between the model and your users. Build one if you need one.

Prefer GGUF format over pickle-based formats.

Quantized models in GGUF format are structurally safer than pickle-based formats because GGUF does not support arbitrary code execution during loading.

Use encrypted storage if processing sensitive data.

The model weights themselves are not sensitive, but cached contexts, conversation logs, and KV caches might contain sensitive data from your prompts.

Quick start

From zero to chatting with a local LLM in under five minutes. Copy and paste these commands.

terminal

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (Llama 3.1 8B, ~4.7GB download)
ollama pull llama3.1

# Start chatting
ollama run llama3.1

# Or use the API
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Explain XSS in 3 sentences"}'

Run AI Locally

Who this is for

Why run locally

Privacy

Cost

Speed

Offline Access

Hardware requirements

Budget Build

Sweet Spot

Power User

Mac Users

Software options

Ollama

LM Studio

llama.cpp

vLLM

Model recommendations

Security considerations

Quick start

Want more hands-on guides?

Run AI Locally

Who this is for

Why run locally

Privacy

Cost

Speed

Offline Access

Hardware requirements

Budget Build

Sweet Spot

Power User

Mac Users

Software options

Ollama

LM Studio

llama.cpp

vLLM

Model recommendations

Security considerations

Quick start

Want more hands-on guides?