February 2026 - April 2026
Obtaining State-of-the-Art (SOTA) competitive performance with open-source LLMs on edge devices.
I've recently been tuning LLMs for the edge with great speedups on smaller models. Does it run fast? No. But it does answer better.
On artificialanalysis.ai, there are leaderboards for aggregate intelligence over multiple tests such as Humanity's Last Exam. These tests are designed to test reasoning, instruction following, tool usage, and other abilities of LLMs. My tuned models rank in the "tiny" size class, with common chat sites like ChatGPT and Gemini landing in the "large" class. This leaves a large intelligence gap between Halo, my edge device, and the services I've been wanting to replace.
Using the leaderboards, I can see performance scores for proprietary models and services.
Any user on ChatGPT for free can use GPT-4 Turbo as much as they want, with limited access to GPT-5.3 if you make an account. For comparison's sake, here are the scores of the "target" models:
| Model | Score |
|---|---|
| Gemini 3.1 Pro | 57 |
| GPT-5.3 Codex | 54 |
| Claude Opus 4.6 | 53 |
| Gemini 3.0 Flash | 46 |
| GPT 5.4 Nano | 44 |
| GLM 4.7 | 42 |
| DeepSeek V3.2 | 42 |
| Claude 4.5 Haiku | 37 |
| Gemini 2.5 Pro | 35 |
| GPT-4o | 19 |
| GPT-4 Turbo | 14 |
And here are the scores of my models from Tuning LLMs for The Edge:
| Model | Score |
|---|---|
| Qwen3.5-0.8B | 11 |
| DeepSeek-R1-1.5B | 9 |
| Qwen3-1.7B | 8 |
| LFM2.5-1.2B-Thinking | 8 |
Qwen3.5 can compete with both speed and intelligence with GPT-4 Turbo, but my models are lacking in top-end intelligence.
The recent release of Qwen3.5 and Gemma 4 allows for more intelligence in the same small footprint.
| Model | Score |
|---|---|
| Qwen3.5-4B | 27 |
| Gemma4-E4B | 19 |
| Qwen3.5-2B | 16 |
| Qwen3.5-0.8B | 11 |
Qwen3.5's 2B parameter counterpart scores 16, outscoring GPT-4 Turbo at the cost of nearly halving throughput (19.08 t/s generation vs. 31.65 t/s)
There are more reasoning-distilled models available, such as Jackrong's Qwopus3.5 v3. Based off of the previous reasoning version, it trained off of the chain of thoughts (CoT) of SOTA models such as GPT-4.5-Pro and Claude Opus 4.6. This model is likely slightly smarter than the original Qwen3.5-4B, but still a far cry from the scores shown at the top end.
At a sacrifice of speed, I've found some techniques to run models at the edge that are competitive with the best free AI services.
Chain of Thought techniques introduced from LLMs inspired DeepSeek's landmark R1 model and paper, with recipe included. The model was built with pure reinforcement learning, making candidate models compete against each other in groups. This reduces the overall compute required for training, and also reduces the compute required to run the model.
Mixture of Experts (MoE) is a technique used by DeepSeek's architecture that uses multiple internal "experts" to answer a prompt. These models will have a larger parameter count, but only have a fraction of those parameters activated at one time. The computer will only have to work on the "active" parameters, while the rest of the parameters or "experts" are stored in RAM. This allows for smarter models in a smaller active footprint, with the downside of needing more RAM.
Thankfully, I now have ~22GB of RAM to work with on my Halo node. Here are some of the competitors in the MoE space that fit on Halo:
| Model | Score |
tg128 (t/s) |
|---|---|---|
| Qwen3.6-35B-A3B | 43 | 7.37 |
| Qwen3.5-35B-A3B | 37 | 7.48 |
| gpt-oss-20B | 24 | 8.27 |
| Qwen3-30B-A3B | 22 | 13.25 |
Quantization matters, and Unsloth has provided a version of Qwen3.6 with an extremely low memory footprint while not losing out on much intelligence.
The context the model uses can also be quantized, which I've set to iq4_nl, a modern quantization to fit as much info as possible into each piece of context.
The combination of these two allows for higher inference speeds and lower VRAM usage.
For chatbots and agents, priorities lie in speed and long-context memory for tool use. For example, you wouldn't want a model to die halfway through a conversation or scraping 10 webpages. For this use case, I've selected Qwen3.5-2B for its fast speed and small memory footprint, as well as tool usage and instruction following capabilities.
For reasoning and high-level thinking, I've selected Qwen3.6-35B-A3B for the same reasons as Qwen3.5-2B, but bigger. Previously, Qwen3.5-35B's score of 37 on the leaderboard is equivalent to Claude 4.5 Haiku, which had only released 4 months earlier. Qwen3.6 presents a bump in coding and other abilities to score at 43, scoring above GPT-5 Mini, DeepSeek V3.2, and GLM 4.7. I don't mind waiting a little longer for stronger reasoning, so this model is more for sending bulk work and coming back to it later.
Using LLM benchmark leaderboards and comparing their performance on my hardware, I've found models that fit into my use cases while being competitive with free services that require an account.