Cloud-based coding assistants are excellent—until your internet drops mid-session, your company bans sending proprietary code to third-party APIs, or you get hit with an unexpected $400 bill because your agent burned through tokens overnight. That is when a local LLM for coding stops being a hobby project and starts being a necessity.
I have spent the last three months running seven open-weight coding models on everything from a laptop with 8GB of VRAM to a dual-GPU workstation. This is not a benchmark regurgitation exercise. I wrote real code with these models—FastAPI backends, React components, Bash scripts, database migrations—and tracked where each one excelled and where it fell apart.
Here is what I found.
Disclosure: This article contains affiliate links. If you purchase through our links, we may earn a commission at no extra cost to you. We only recommend tools we have tested and trust.
TL;DR — Quick Pick Table
| Rank | Model | Best For | Min VRAM | HumanEval |
|---|---|---|---|---|
| #1 | Qwen3-Coder-Next | Agentic coding, best overall | 24 GB | 94.1% |
| #2 | DeepSeek-V3.2 | Algorithmic/math-heavy code | 24 GB | 93.4% |
| #3 | Qwen 2.5 Coder 32B | All-around workhorse | 20 GB | 92.7% |
| #4 | Devstral Small 2 | SWE-bench tasks, real-world bugs | 16 GB | 89.7% |
| #5 | Codestral 25.12 | IDE autocomplete / FIM | 16 GB | 89.7% |
| #6 | DeepSeek-Coder V3 Distilled | Budget GPU, 338 languages | 12 GB | 87.2% |
| #7 | StarCoder2 15B | Laptops, fine-tuning base | 8 GB | ~73% |
No 24GB GPU? Rent one on RunPod from $0.39/hr and run any model on this list in minutes.
Why Run a Local LLM for Coding?
Before we get into the rankings, let me address the obvious question: why would you run a model locally when Claude, GPT, and Gemini exist?
Four reasons keep coming up in every conversation I have with developers who have made the switch:
1. Privacy and IP Protection
If you work at a company with strict data governance—finance, healthcare, defense, or frankly any organization that takes its intellectual property seriously—sending source code to an external API is often a non-starter. A local model processes everything on your hardware. Your code never leaves your network.
2. Cost at Scale
API-based coding assistants charge per token. One aggressive agentic session can burn through $5-15 in API costs. If you are running coding agents across a team of 20 developers, those costs compound fast. A local model running on hardware you already own costs exactly $0 per token after the initial setup.
3. Offline Access
Airport, train, cabin in the mountains, or just your ISP having a bad day. Local models work regardless. I have written entire features on a flight from Munich to San Francisco with Qwen 2.5 Coder running on my MacBook. No internet required.
4. Customization and Fine-Tuning
Local models can be fine-tuned on your codebase, your conventions, your API patterns. No cloud provider offers this level of customization. If your team has specific coding standards or works with niche frameworks, a fine-tuned local model will outperform a generic cloud model every time.
Hardware Requirements: What GPU Do You Actually Need?
This is where most guides get lazy and just say “more VRAM is better.” That is true but useless. Here is a concrete breakdown of what you can run at each hardware tier, based on Q4_K_M quantization (the sweet spot between quality and speed):
8 GB VRAM (RTX 3060, RTX 4060, M1/M2 MacBook)
- What fits: StarCoder2 15B, Qwen 2.5 Coder 7B, DeepSeek-Coder 6.7B
- Speed: 25-40 tokens/sec
- Reality check: Good for autocomplete, basic code generation. Struggles with complex multi-file refactoring.
16 GB VRAM (RTX 4070 Ti, RTX 4080, M2 Pro/Max)
- What fits: Codestral 25.12, Devstral Small 2, Qwen 2.5 Coder 14B
- Speed: 20-35 tokens/sec
- Reality check: The sweet spot for solo developers. Codestral at this tier is genuinely excellent for IDE integration.
24 GB VRAM (RTX 3090, RTX 4090, M3 Max 36GB)
- What fits: Qwen 2.5 Coder 32B, Qwen3-Coder-Next (tight), DeepSeek-V3.2 (tight)
- Speed: 15-30 tokens/sec depending on model
- Reality check: This is where local models start matching cloud API quality. Qwen 2.5 Coder 32B on a 4090 is the experience that converts skeptics.
48 GB+ VRAM (Dual GPU, A6000, Mac Studio M4 Ultra 192GB)
- What fits: Everything, including full-precision Qwen3-Coder-Next and DeepSeek-V3.2
- Speed: 10-20 tokens/sec for largest models
- Reality check: Overkill for most individual developers. Enterprise territory or serious enthusiasts.
My recommendation: If you are buying a GPU specifically for local coding LLMs, get a 24GB card. The RTX 4090 or a used RTX 3090 gives you access to the top-tier models. Anything less and you are making meaningful quality compromises.
The 7 Best Local LLMs for Coding in 2026
#1. Qwen3-Coder-Next — The New King
Parameters: 80B total (3B active via MoE) | Min VRAM: 24 GB (Q4) | Context: 256K | License: Apache 2.0
Qwen3-Coder-Next from Alibaba’s Qwen team is the best local coding LLM available today, full stop. Released in February 2026, it hit #1 on SWE-bench Pass@5 with 64.6%—beating every closed model including Claude Opus 4.6 and GPT-5.2 on that specific metric. Its HumanEval score of 94.1% puts it in rarefied air for an open-weight model.
The secret is its Mixture-of-Experts architecture: 80 billion total parameters, but only 3 billion activate per inference step. This means it reasons like a massive model but runs at a speed that is actually usable for interactive coding. On my RTX 4090, I get roughly 18-22 tokens per second at Q4 quantization, which is fast enough for real-time coding assistance.
Where it truly shines is agentic coding. It was trained specifically for long-horizon tasks—planning a multi-step refactor, executing it across files, running tests, and recovering from errors. I pointed it at a 400-line Django view that needed to be split into separate service and serializer layers. It planned the decomposition, created the new files, updated imports across the project, and fixed the two test failures that resulted. All locally.
Weakness: At Q4 quantization on 24GB, context starts getting tight with very large codebases. You may need to be strategic about what context you feed it.
#2. DeepSeek-V3.2 — The Algorithm Specialist
Parameters: 671B total (37B active) | Min VRAM: 24 GB (Q4) | Context: 160K | License: MIT
DeepSeek’s V3.2 release remains the strongest open-weight model for algorithmic and math-heavy code. Its 93.4% HumanEval score is nearly identical to Qwen3-Coder-Next, and on LiveBench Coding Average it scores 75.69—the highest among open-source contenders.
I found DeepSeek-V3.2 consistently better than Qwen3-Coder for three specific tasks: implementing data structures from scratch, writing numerical/scientific Python, and optimizing existing algorithms. When I asked both models to implement a concurrent B-tree with lock-free reads, DeepSeek produced a correct implementation on the first try. Qwen needed two iterations.
The downside is hardware requirements. At 671B total parameters (37B active), it technically fits on 24GB at aggressive quantization, but it is a tight squeeze and context length suffers. For comfortable use, you want 48GB or more. Most individual developers will be better served by Qwen3-Coder-Next unless algorithmic work is their primary use case.
Weakness: Massive total parameter count means slow model loading and high disk usage (~35GB for Q4). Not practical for quick startup-and-stop sessions.
#3. Qwen 2.5 Coder 32B — The Reliable Workhorse
Parameters: 32B dense | Min VRAM: 20 GB (Q4) | Context: 128K | License: Apache 2.0
If Qwen3-Coder-Next is the shiny new thing, Qwen 2.5 Coder 32B is the model that has been quietly earning trust for the past eight months. It scores 92.7% on HumanEval, handles 128K context, and has the broadest community support of any local coding model—the most Ollama downloads, the most fine-tuned variants, the most tutorials.
This is the model I recommend to developers who ask “which local LLM should I try first?” It has no weird quirks, no surprising failure modes, and generates consistently solid code across Python, TypeScript, Go, Rust, and Java. It is also a dense model (not MoE), which means inference is more predictable and quantization is more straightforward.
At Q4_K_M on a 24GB GPU, you get roughly 20-25 tokens per second with plenty of headroom for context. It runs well on a Mac with 32GB of unified memory too. For 90% of day-to-day coding tasks, the quality gap between this and the top-ranked models is negligible.
Weakness: Its SWE-bench scores lag behind the top two. For complex multi-file agentic tasks, Qwen3-Coder-Next is measurably better.
#4. Devstral Small 2 — The SWE-bench Champion at Its Size
Parameters: 24B dense | Min VRAM: 16 GB (Q4) | Context: 256K | License: Apache 2.0
Mistral’s Devstral Small 2 is a 24B-parameter model that punches absurdly above its weight. It scores 68.0% on SWE-bench Verified—the strongest result for any open-weight model under 30B parameters, outscoring many 70B-class competitors. It also supports image inputs for multimodal agent workflows.
The 256K context window is a standout feature at this parameter count. Combined with its 16GB VRAM requirement at Q4, this is the model that makes local coding accessible on mid-range hardware. An RTX 4070 Ti or a MacBook Pro with 16GB can run it comfortably.
In my testing, Devstral Small 2 was the best model for understanding and fixing bugs in existing codebases. Point it at a failing test, give it the relevant source files, and it will correctly diagnose and fix the issue more often than models twice its size. Mistral clearly optimized for the “real developer workflow” rather than synthetic benchmarks.
Weakness: Raw code generation quality (HumanEval) is a step below the top three. If you need the model to write large amounts of new code from scratch, Qwen 2.5 Coder 32B is stronger.
#5. Codestral 25.12 — The IDE Autocomplete King
Parameters: 22B dense | Min VRAM: 16 GB (Q4) | Context: 64K | License: Mistral AI Non-Production License
If your primary use case is fast, accurate autocomplete inside your IDE—not agentic coding, not chat-based generation, just fill-in-the-middle (FIM) completions as you type—Codestral 25.12 is the best local model for the job. It was specifically trained on FIM tasks and it shows: completions are contextually aware, fast, and rarely produce the runaway generations that plague other models in autocomplete mode.
Codestral 25.12 supports 80+ programming languages and integrates directly with VS Code, Cursor, and JetBrains IDEs through Continue.dev or similar extensions. The 22B parameter count keeps inference snappy—I consistently got 30+ tokens per second on 16GB VRAM, which feels instant for autocomplete.
Weakness: The non-production license is a deal-breaker for some enterprises. Codestral is free for development and testing, but commercial deployment requires a paid agreement with Mistral. Also, its 64K context window is the smallest on this list.
#6. DeepSeek-Coder V3 Distilled 16B — The Budget Pick
Parameters: 16B | Min VRAM: 12 GB (Q4) | Context: 128K | License: MIT
DeepSeek-Coder V3 Distilled is the model to run when you have limited hardware but still want meaningful coding assistance. At just 12GB VRAM, it fits on an RTX 3060 12GB or an older RTX 3080—cards you can find used for under $250.
Its 87.2% HumanEval score is solid for its size, and it supports an impressive 338 programming languages. If you work with niche languages like Haskell, Elixir, or Zig, this model has better coverage than alternatives in its weight class. The MIT license means there are zero restrictions on commercial use.
In practice, I found it genuinely useful for boilerplate generation, writing tests, and explaining existing code. Where it falls short is multi-step reasoning—ask it to plan and execute a complex refactor and it loses the thread. For that, you need a larger model.
Weakness: Cannot handle complex agentic workflows. SWE-bench score (40.5%) confirms it struggles with real-world multi-file tasks.
#7. StarCoder2 15B — The Fine-Tuning Foundation
Parameters: 15B | Min VRAM: 8 GB (Q4) | Context: 16K | License: BigCode Open RAIL-M
StarCoder2 is the oldest model on this list, and its benchmark scores show it. At roughly 73% on HumanEval, it is outclassed by every other model here for raw code generation. So why include it?
Two reasons. First, it runs on 8GB of VRAM. If you have a laptop with an RTX 3050 or a base M1 MacBook, StarCoder2 is your best option for local code completion. Second, it is the most popular base model for fine-tuning. The BigCode training pipeline is well-documented, the community has produced hundreds of specialized variants, and if you want to train a model on your company’s internal codebase, StarCoder2 is the most practical starting point.
Weakness: Benchmark scores are a full generation behind the leaders. The 16K context window is limiting. Use it as a fine-tuning base or a laptop fallback, not as your primary coding model.
How to Set Up: Ollama Quickstart (5 Commands)
The easiest way to run any of these models locally is Ollama. Here is the complete setup:
# 1. Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull a coding model (pick one based on your VRAM)
ollama pull qwen2.5-coder:32b # 24GB GPU — best all-around
ollama pull devstral-small-2 # 16GB GPU — strong SWE-bench
ollama pull qwen2.5-coder:7b # 8GB GPU — laptop-friendly
# 3. Start coding interactively
ollama run qwen2.5-coder:32b "Write a FastAPI endpoint that accepts
a JSON payload, validates it with Pydantic, and stores it in PostgreSQL"
# 4. Or use as a local API (compatible with OpenAI SDK)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen2.5-coder:32b", "messages": [{"role": "user",
"content": "Refactor this function to use async/await"}]}'
# 5. Connect to your IDE via Continue.dev or similar
# In VS Code: install Continue extension → set model to localhost:11434
That is it. Five commands and you have a private, local coding assistant. No API keys, no usage limits, no data leaving your machine.
Alternative runners: LM Studio offers a GUI-based experience if you prefer point-and-click. For production serving with batching and multi-GPU support, vLLM is the standard. And text-generation-webui provides a full web interface with extensive model management features.
Benchmark Comparison Table
Numbers matter, but context matters more. HumanEval tests isolated function generation. SWE-bench tests real-world bug fixing across full repositories. Speed was measured on an RTX 4090 at Q4_K_M quantization.
| Model | HumanEval | SWE-bench Verified | Speed (tok/s) | Context | VRAM (Q4) |
|---|---|---|---|---|---|
| Qwen3-Coder-Next | 94.1% | 70.6% | ~20 | 256K | 24 GB |
| DeepSeek-V3.2 | 93.4% | 56.1% | ~15 | 160K | 24 GB |
| Qwen 2.5 Coder 32B | 92.7% | ~48% | ~22 | 128K | 20 GB |
| Devstral Small 2 | ~89% | 68.0% | ~28 | 256K | 16 GB |
| Codestral 25.12 | 89.7% | 42.0% | ~32 | 64K | 16 GB |
| DeepSeek-Coder V3 Dist. | 87.2% | 40.5% | ~35 | 128K | 12 GB |
| StarCoder2 15B | ~73% | N/A | ~40 | 16K | 8 GB |
Note: SWE-bench Verified scores are from official reports where available. Speed varies significantly based on context length and quantization method. Your mileage will vary.
When Local Is Not Enough: Cloud GPU Alternative
Let me be honest about the limitations. There are scenarios where running locally does not work:
- You do not have a 24GB GPU and you want to run top-tier models without compromise.
- You need to serve a team of developers from a single model instance with batched inference.
- You want to test multiple models quickly without downloading 20-40GB per model.
- You need the absolute largest models (DeepSeek-V3.2 at full precision, for example) that require 80GB+ of VRAM.
For all of these, cloud GPUs are the practical answer. RunPod is what I use when I need to test models that do not fit on my local hardware. You can rent an A100 80GB for $1.64/hr or an RTX 4090 for $0.39/hr, spin up an Ollama instance, run your tests, and shut it down. No commitment, no long-term contracts.
The workflow is straightforward:
# On RunPod: deploy an Ollama template pod with an A100 80GB
# Then SSH in and pull whatever model you want:
ollama pull deepseek-v3.2
ollama pull qwen3-coder-next
# Use the RunPod proxy URL as your API endpoint
# Works with any tool that supports the OpenAI API format
DigitalOcean also offers GPU droplets for model hosting if you prefer a more traditional cloud provider with predictable monthly billing. Their GPU instances start at $2.99/hr for an H100 and include managed Kubernetes for production deployments.
Privacy-conscious developers downloading models from Hugging Face behind restricted corporate networks may also want to consider a VPN—NordVPN is a solid option for keeping your download activity private, especially when pulling models from international repositories.
The Bottom Line
The best local LLM for coding in 2026 depends entirely on your hardware:
- 24GB GPU: Run Qwen3-Coder-Next. It is the best open-weight coding model available. Period.
- 16GB GPU: Run Devstral Small 2 for agentic tasks or Codestral 25.12 for IDE autocomplete.
- 12GB GPU: Run DeepSeek-Coder V3 Distilled. Best quality at this VRAM tier.
- 8GB GPU: Run Qwen 2.5 Coder 7B or StarCoder2. Functional, but you will feel the limitations.
- No GPU: Rent one on RunPod. An RTX 4090 at $0.39/hr is cheaper than most API plans if you use it a few hours per week.
The gap between local and cloud models has collapsed. Qwen3-Coder-Next running on a consumer GPU genuinely rivals the best proprietary models for coding tasks. The main trade-off is no longer quality—it is convenience. Cloud models are easier to set up and always available. Local models are private, free, and yours to customize.
For most developers reading this, the right answer is both: a local model for day-to-day private coding, and a cloud API for the occasional task that needs maximum power or context.
This article was last updated April 2026. Model benchmarks and availability change frequently—we will update this guide as new models are released.