They said unquantized local AI was impossible on budget phones. We got a 2.3GB FP32 model running locally on a €120 Galaxy A25 CPU. No GPU, no NPU, uses less RAM than Chrome

TheOpenMachineAI · May 4, 2026, 9:10am

The current meta in local AI is that you have to quantize. Big Tech is telling us that to run anything on the edge, we need to compress 2B+ models down to 4-bit, sacrifice the signal-to-noise ratio, and rely on flagship NPUs or Apple Silicon just to survive the memory bandwidth bottleneck.

We at Open Machine didn’t buy it. So, we built a 245M parameter model from scratch, kept it in raw uncompressed 32-bit float (FP32), and ran it on the absolute worst hardware I could find: a two-year-old plastic Samsung Galaxy A25.

Attached is the raw screen recording. Airplane mode on.

The Specs:

Model: Open Machine 245M (Trained from scratch on 20B tokens)
Weights: 2.3GB pure FP32 (ONNX export)
Hardware: €120 Samsung A25 (Exynos 1280)
Compute: CPU ONLY. GPU is off. NPU is off.
RAM: ~4.4GB used (literally lighter than opening a few Chrome tabs).
Thermals: 33.3°C. Zero battery drain. No OOM crashes.

The Elephant in the Room: 0.17 tokens/s Yes, it is slow as hell right now. If it were running at 50 tok/s on a budget CPU in FP32, you guys would immediately (and rightfully) call BS and accuse me of hiding a 4-bit quantized lookup table or using an API.

This speed is because it’s a heavily unoptimized Python loop forcing raw 32-bit math sequentially through a budget mobile CPU. We deliberately handicapped it to prove a point about physics: The memory wall is a routing problem, not a compression problem. If a budget Exynos chip can physically route a 2.3GB FP32 graph without OS memory-killing it or melting the battery, the architecture works. Writing a C++ kernel and dropping to FP16 will make it fly later.

How it fits without OOMing: We didn’t compress the weights; we fixed the network topology. We’re using what we call a “Synthetic Neural Engine” architecture. Instead of vanilla dense transformers(this is also Trasnformers but on our way) where you’re wasting compute on 90% static noise, we proceduralize the weights. We store a semantic dictionary of primitives and a per-context recipe that reconstructs exact weights dynamically full W. We calculate exact attention but store a compact state. Basically, we only compute the pure signal.

The Benchmarks: Even though it was trained on only 20B tokens (DCLM subset) for less than €1,000, this 245M model is already hitting 66% on PIQA and matching Meta’s 350M logic.

We built it anyway over the weekend and dropped this APK in their inbox today.

Stop letting Big Tech convince you that you need $7 Trillion and a massive server farm to solve edge logic.

Roast or Python loop, ask me about the math, and let me know what you think. I’ll drop the Hugging Face benchmark links in the comments, you can download it and test it yourself.

Topic		Replies	Views
🚀 Bringing Supercomputer-Grade AI Performance to Local CPUs: Purem Benchmarks Now Public Show and Tell	0	44	April 28, 2025
Determining if a model will run locally Beginners	4	5041	April 7, 2025
I made some open source software to run UNQUANTIZED Mistral 7b-Instruct on about 2GB of RAM Show and Tell	1	234	April 16, 2025
I make AI model from open to close Beginners	1	61	December 18, 2025
How to find models that work on low memory/CPU edge devices Models	3	1126	October 17, 2024

They said unquantized local AI was impossible on budget phones. We got a 2.3GB FP32 model running locally on a €120 Galaxy A25 CPU. No GPU, no NPU, uses less RAM than Chrome

Related topics