OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)
Hi everyone,
I wanted to share a project that just released on PyPI: OMGFormer, an open-source PyTorch framework for building and training parallel masked diffusion language models.
What is it?
OMGFormer implements the same class of architecture behind Inception Labs’ Mercury — the first commercial-scale diffusion LLM ($50M funded, 1100+ tokens/sec on H100). The key difference: OMGFormer is fully open-source, Apache 2.0, and lets you train your own model from scratch.
Instead of generating tokens one at a time (autoregressive), it generates all tokens in parallel via iterative unmasking:
Step 0: "Hello [MASK] [MASK] [MASK] [MASK]"
Step 1: "Hello world [MASK] [MASK] [MASK]"
Step 2: "Hello world how are [MASK]"
Step 3: "Hello world how are you?"
256 tokens → 6–10 forward passes instead of 256. With Self-Conditioning, quality stays comparable at even fewer steps.
What shipped (v2.0.5)
The project is very new (~3 days old, one developer) and has no benchmarks yet due to limited compute resources. But the codebase is surprisingly complete:
Core architecture (60 features):
-
GQA, MLA (DeepSeek-style), Sliding Window, Linear Attention
-
AdaLN-Zero timestep conditioning (DiT-style)
-
Self-Conditioning, Absorbing Diffusion, Remasking
-
MoE: top-K, Expert Choice (Google Switch), Soft MoE (Google Brain 2023), Shared Expert (DeepSeek)
-
LoRA variants: standard, DoRA, QLoRA, rsLoRA, LoRA+
-
Advanced: KV Cache, MTP head, Model Merging (SLERP/DARE/TIES), PPO/Reward head, GGUF export stub, RAG injector, Dynamic batching
omg_data — Automated data pipeline:
pipe = DataPipeline(language="tr", task="chat", size_gb=5, tokenizer="gpt2")
dataset = pipe.build() # finds → downloads → cleans → tokenizes automatically
Supports 15+ languages, 6 task types, full cleaning pipeline (dedup, HTML, URL, unicode, lang filter).
omg_hybridomga — Unified training engine:
-
All 6 LoRA methods in one package
-
Novel OMGa (OMG Adaptive LoRA): per-token learned gate with dual-rank adapters
-
VRAM guard, OOM recovery, MorphicMemory (Markov allocation prediction + tensor reuse)
-
SpectraOptimizer (FFT-domain adaptive AdamW), ResonanceScheduler (gradient-spectrum self-tuning LR)
-
GradientHarmonics (wavelet noise injection), NeuralProfiler (tiny LSTM predicts OOM/explode risk)
-
Unstoppable trainer — retries from checkpoint on any failure
Current state
No training benchmarks yet due to limited compute resources. The architecture is solid and the code is well-written, but:
-
No training benchmarks yet (developer has limited GPU access)
-
Some stubs not fully implemented (Flash Attention 2 flag exists but falls back to SDPA)
-
MoE not yet fully integrated into OMGConfig (listed for next release)
-
No pretrained weights — you train from scratch
The developer is actively working on it and releases are moving fast.
Installation
pip install omgformer # core
pip install omg_data # data pipeline
pip install omg_hybridomga # training engine
Quick start:
from omgformer import OMGConfig, OMGModel, MaskScheduler, ParallelDecoder
cfg = OMGConfig.from_preset("omgformer-small") # ~87M params
model = OMGModel(cfg)
sched = MaskScheduler(steps=10, mask_token_id=cfg.mask_token_id, vocab_size=cfg.vocab_size)
decoder = ParallelDecoder(model, sched)
- PyPI: omgformer · PyPI
Looking for feedback
Since there are no benchmark results yet, the community’s help would be very valuable. If anyone has spare compute and wants to run experiments — even small ones on omgformer-tiny or omgformer-small — and share results here, that would help validate (or challenge) the approach.
Specific things worth testing:
-
Does loss converge normally on small datasets?
-
How does generation quality compare to a similarly-sized autoregressive model at the same step budget?
-
Any bugs in the data pipeline for non-English languages?
Happy to discuss the architecture or the diffusion LM approach in general.