Building Local: My 2026 Headless AI Server Journey

Hi everyone!

Just wanted to say it’s great to be here! I’m currently having a blast building and maintaining a local, headless AI server.

There’s something special about running models on your own hardware. I’m curious to know:

  • What are you all currently running on your local setups?

  • Any “hidden gem” projects or tools you think more people should know about in 2026?

Looking forward to catching up with the community!

Just use Gemma 4 31B with a smaller Gemma4 for secular decoding

What are you all currently running on your local setups?

Hmm… I use only small embedding models every day. I’ve integrated them into my work scripts. Since my GPU isn’t very powerful (a 3060 Ti with 8 GB of memory), I don’t really use very large models often locally…

That said, I’ve heard that if you use MoE LLMs via GGUF on platforms like Ollama or LM Studio, they run smoothly even with just within 32GB of RAM (not VRAM)

Personally, since most of my current use cases don’t require confidentiality, I just use cloud services for my LLMs.
Of course, I often try out models (LLM, T2I, etc.) hosted on HF via Spaces.

"You’re totally right to keep an eye on them—local models have come a long way! Even with a 3060 Ti, you can actually run some impressive stuff.

Because of how GGUF works now, you can ‘offload’ specific layers to your 8GB VRAM and let the rest spill over into your 32GB of system RAM. If you want to try it, I’d recommend starting with Gemma-4-E4B-it-Q8_0.gguf.

At that quantization, the model is about 7.5GB. If you set your context to PARAMETER num_ctx 32768, it should fit comfortably across your VRAM and RAM. You’ll probably see speeds around 8–15 tokens/sec—not blazing fast, but the reasoning quality is excellent for a local setup.

If you want to go even bigger, you could technically run the Gemma-4 26B (A4B). By putting as many layers as possible on the GPU and the rest on your memory, you’d likely hit 6–8 tokens/sec. Even if you went full system memory, you’d still get about 3–4 tokens/sec. It’s definitely worth a shot if you want cloud-level smarts without the privacy concerns!"


Yeah. The Gemma 4 family is amazing.:laughing:
Even though it’s still so new that the backend support isn’t fully polished yet, the generated results are clearly better.
I was amazed by the multilingual performance back when Qwen 2.5 and Gemma 2 came out, too…

I occasionally test models on Hub (by my Space), focusing mainly on small multilingual LLMs under 14B. It’s just random, spot-checking. Models of the same size have just kept getting better and better over the past few months for years… Rather than a steady pace, every now and then an amazing model pops up.

"I totally agree on the ‘pop up’ nature of these releases—it feels like we go months with small tweaks, and then a model like Gemma 4 just resets the baseline for what’s possible on consumer hardware.

I’m actually restructuring my whole setup to give these new models more room to run. I’m moving to a three-node system:

  1. AI Headless Server: My gaming PC (7800 XT 16GB / 5600 CPU) dedicated 100% to the LLM weights. No display out, no background apps—just raw VRAM for the model.

  2. Middleware Server: A Lenovo ThinkServer handling the ‘heavy lifting’ of the UI (Open WebUI), RAG/File processing (AnythingLLM), and the Cloudflare tunnel.

  3. Daily Driver: My main PC just for the GUI.

My goal is to get the Gemma 4 26B (A4B) running at its full potential. By keeping the ‘Admin’ tasks on the ThinkServer, I’m hoping to keep that 26B model snappy (aiming for 20 t/s) while keeping the intelligence of a much larger model. It really feels like we’re finally reaching the point where local ‘mid-range’ hardware can compete with the big cloud models."

Project Summary: Multi-Server Cluster Optimization

Timeline: April 24, 2026 (approx. 3 hours of tuning)


1. Infrastructure & UI Upgrade

  • Update: Migrated the Lenovo ThinkServer to Open WebUI v2.0 (v0.9.2).

  • Result: Enabled the “Thinking Mode” UI and moved to a high-concurrency async database, allowing the server to manage massive file libraries (2TB SSD) without lag.

2. Model Deployment: Gemma 4 26B

  • Model: Deployed Gemma 4 26B-A4B-it (Q4_K_M).

  • Advantage: Used Mixture-of-Experts (MoE) logic to gain 26B-level reasoning while only activating 3.8B parameters per token—giving you high-end intelligence at mid-range speeds.

3. Hardware Tuning: “The 97% Sweet Spot”

  • GPU: AMD Radeon 7800 XT (16GB VRAM).

  • Config: Manually tuned to 27 layers with a 16,000 context window.

  • Result: Achieved 97% VRAM utilization. This maximizes the GPU’s capacity, leaving just enough room for context growth while spilling only the final 3 layers to system RAM.


4. Benchmarks

Metric Performance
Generation Speed 25.13 tokens/s (Instantaneous feel)
Prompt Processing 283.03 tokens/s (Fast large-file reading)
Stability 100% stable at 97% load (Headless mode)