OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)

omurberaisik · May 6, 2026, 8:25pm

OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)

Hi everyone,

I wanted to share a project that just released on PyPI: OMGFormer, an open-source PyTorch framework for building and training parallel masked diffusion language models.

What is it?

OMGFormer implements the same class of architecture behind Inception Labs’ Mercury — the first commercial-scale diffusion LLM ($50M funded, 1100+ tokens/sec on H100). The key difference: OMGFormer is fully open-source, Apache 2.0, and lets you train your own model from scratch.

Instead of generating tokens one at a time (autoregressive), it generates all tokens in parallel via iterative unmasking:

Step 0: "Hello [MASK] [MASK] [MASK] [MASK]"
Step 1: "Hello world  [MASK] [MASK] [MASK]"
Step 2: "Hello world  how   are  [MASK]"
Step 3: "Hello world  how   are   you?"

256 tokens → 6–10 forward passes instead of 256. With Self-Conditioning, quality stays comparable at even fewer steps.

What shipped (v2.0.5)

The project is very new (~3 days old, one developer) and has no benchmarks yet due to limited compute resources. But the codebase is surprisingly complete:

Core architecture (60 features):

GQA, MLA (DeepSeek-style), Sliding Window, Linear Attention
AdaLN-Zero timestep conditioning (DiT-style)
Self-Conditioning, Absorbing Diffusion, Remasking
MoE: top-K, Expert Choice (Google Switch), Soft MoE (Google Brain 2023), Shared Expert (DeepSeek)
LoRA variants: standard, DoRA, QLoRA, rsLoRA, LoRA+
Advanced: KV Cache, MTP head, Model Merging (SLERP/DARE/TIES), PPO/Reward head, GGUF export stub, RAG injector, Dynamic batching

omg_data — Automated data pipeline:

pipe = DataPipeline(language="tr", task="chat", size_gb=5, tokenizer="gpt2")
dataset = pipe.build()  # finds → downloads → cleans → tokenizes automatically

Supports 15+ languages, 6 task types, full cleaning pipeline (dedup, HTML, URL, unicode, lang filter).

omg_hybridomga — Unified training engine:

All 6 LoRA methods in one package
Novel OMGa (OMG Adaptive LoRA): per-token learned gate with dual-rank adapters
VRAM guard, OOM recovery, MorphicMemory (Markov allocation prediction + tensor reuse)
SpectraOptimizer (FFT-domain adaptive AdamW), ResonanceScheduler (gradient-spectrum self-tuning LR)
GradientHarmonics (wavelet noise injection), NeuralProfiler (tiny LSTM predicts OOM/explode risk)
Unstoppable trainer — retries from checkpoint on any failure

Current state

No training benchmarks yet due to limited compute resources. The architecture is solid and the code is well-written, but:

No training benchmarks yet (developer has limited GPU access)
Some stubs not fully implemented (Flash Attention 2 flag exists but falls back to SDPA)
MoE not yet fully integrated into OMGConfig (listed for next release)
No pretrained weights — you train from scratch

The developer is actively working on it and releases are moving fast.

Installation

pip install omgformer           # core
pip install omg_data            # data pipeline
pip install omg_hybridomga      # training engine

Quick start:

from omgformer import OMGConfig, OMGModel, MaskScheduler, ParallelDecoder

cfg     = OMGConfig.from_preset("omgformer-small")  # ~87M params
model   = OMGModel(cfg)
sched   = MaskScheduler(steps=10, mask_token_id=cfg.mask_token_id, vocab_size=cfg.vocab_size)
decoder = ParallelDecoder(model, sched)

PyPI: omgformer · PyPI

Looking for feedback

Since there are no benchmark results yet, the community’s help would be very valuable. If anyone has spare compute and wants to run experiments — even small ones on omgformer-tiny or omgformer-small — and share results here, that would help validate (or challenge) the approach.

Specific things worth testing:

Does loss converge normally on small datasets?
How does generation quality compare to a similarly-sized autoregressive model at the same step budget?
Any bugs in the data pipeline for non-English languages?

Happy to discuss the architecture or the diffusion LM approach in general.

Topic		Replies	Views
GPT / BART / Large masked language model Models	0	241	May 23, 2023
Training an original `nn.Module` w/MLM using HF Beginners	0	296	December 4, 2022
Custom 4D attention masks in training OOM 🤗Transformers	7	703	October 6, 2024
How to train the Translation Language Modeling (TLM) with transformers/examples/language-modeling/run_mlm.py? 🤗Transformers	2	1000	June 26, 2021
Masked language modeling loss 🤗Transformers	1	5060	August 13, 2020

OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)

OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)

What is it?

What shipped (v2.0.5)

Current state

Installation

Looking for feedback

Related topics