Towards SOTA on the Edge

February 2026 - April 2026

Obtaining State-of-the-Art (SOTA) competitive performance with open-source LLMs on edge devices.

I've recently been tuning LLMs for the edge with great speedups on smaller models. Does it run fast? No. But it does answer better.

Background

On artificialanalysis.ai, there are leaderboards for aggregate intelligence over multiple tests such as Humanity's Last Exam. These tests are designed to test reasoning, instruction following, tool usage, and other abilities of LLMs. My tuned models rank in the "tiny" size class, with common chat sites like ChatGPT and Gemini landing in the "large" class. This leaves a large intelligence gap between Halo, my edge device, and the services I've been wanting to replace.

The Target

Using the leaderboards, I can see performance scores for proprietary models and services.

Any user on ChatGPT for free can use GPT-4 Turbo as much as they want, with limited access to GPT-5.3 if you make an account. For comparison's sake, here are the scores of the "target" models:

Model	Score
Gemini 3.1 Pro	57
GPT-5.3 Codex	54
Claude Opus 4.6	53
Gemini 3.0 Flash	46
GPT 5.4 Nano	44
GLM 4.7	42
DeepSeek V3.2	42
Claude 4.5 Haiku	37
Gemini 2.5 Pro	35
GPT-4o	19
GPT-4 Turbo	14

And here are the scores of my models from Tuning LLMs for The Edge:

Model	Score
Qwen3.5-0.8B	11
DeepSeek-R1-1.5B	9
Qwen3-1.7B	8
LFM2.5-1.2B-Thinking	8

Qwen3.5 can compete with both speed and intelligence with GPT-4 Turbo, but my models are lacking in top-end intelligence.

Tiny Challengers

The recent release of Qwen3.5 and Gemma 4 allows for more intelligence in the same small footprint.

Model	Score
Qwen3.5-4B	27
Gemma4-E4B	19
Qwen3.5-2B	16
Qwen3.5-0.8B	11

Qwen3.5's 2B parameter counterpart scores 16, outscoring GPT-4 Turbo at the cost of nearly halving throughput (19.08 t/s generation vs. 31.65 t/s)

Distillations

There are more reasoning-distilled models available, such as Jackrong's Qwopus3.5 v3. Based off of the previous reasoning version, it trained off of the chain of thoughts (CoT) of SOTA models such as GPT-4.5-Pro and Claude Opus 4.6. This model is likely slightly smarter than the original Qwen3.5-4B, but still a far cry from the scores shown at the top end.

Making the leap to SOTA

At a sacrifice of speed, I've found some techniques to run models at the edge that are competitive with the best free AI services.

Chain of Thought

Chain of Thought techniques introduced from LLMs inspired DeepSeek's landmark R1 model and paper, with recipe included. The model was built with pure reinforcement learning, making candidate models compete against each other in groups. This reduces the overall compute required for training, and also reduces the compute required to run the model.

Mixture of Experts

Mixture of Experts (MoE) is a technique used by DeepSeek's architecture that uses multiple internal "experts" to answer a prompt. These models will have a larger parameter count, but only have a fraction of those parameters activated at one time. The computer will only have to work on the "active" parameters, while the rest of the parameters or "experts" are stored in RAM. This allows for smarter models in a smaller active footprint, with the downside of needing more RAM.

Thankfully, I now have ~22GB of RAM to work with on my Halo node. Here are some of the competitors in the MoE space that fit on Halo:

Model	Score	`tg128` (t/s)
Qwen3.6-35B-A3B	43	7.37
Qwen3.5-35B-A3B	37	7.48
gpt-oss-20B	24	8.27
Qwen3-30B-A3B	22	13.25

Picking the Right Quant

Quantization matters, and Unsloth has provided a version of Qwen3.6 with an extremely low memory footprint while not losing out on much intelligence. The context the model uses can also be quantized, which I've set to iq4_nl, a modern quantization to fit as much info as possible into each piece of context. The combination of these two allows for higher inference speeds and lower VRAM usage.

Use Cases

Chatbot / Agent

For chatbots and agents, priorities lie in speed and long-context memory for tool use. For example, you wouldn't want a model to die halfway through a conversation or scraping 10 webpages. For this use case, I've selected Qwen3.5-2B for its fast speed and small memory footprint, as well as tool usage and instruction following capabilities.

Reasoning

For reasoning and high-level thinking, I've selected Qwen3.6-35B-A3B for the same reasons as Qwen3.5-2B, but bigger. Previously, Qwen3.5-35B's score of 37 on the leaderboard is equivalent to Claude 4.5 Haiku, which had only released 4 months earlier. Qwen3.6 presents a bump in coding and other abilities to score at 43, scoring above GPT-5 Mini, DeepSeek V3.2, and GLM 4.7. I don't mind waiting a little longer for stronger reasoning, so this model is more for sending bulk work and coming back to it later.

Conclusion

Using LLM benchmark leaderboards and comparing their performance on my hardware, I've found models that fit into my use cases while being competitive with free services that require an account.