OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)

OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)

Hi everyone,

I wanted to share a project that just released on PyPI: OMGFormer, an open-source PyTorch framework for building and training parallel masked diffusion language models.


What is it?

OMGFormer implements the same class of architecture behind Inception Labs’ Mercury — the first commercial-scale diffusion LLM ($50M funded, 1100+ tokens/sec on H100). The key difference: OMGFormer is fully open-source, Apache 2.0, and lets you train your own model from scratch.

Instead of generating tokens one at a time (autoregressive), it generates all tokens in parallel via iterative unmasking:

Step 0: "Hello [MASK] [MASK] [MASK] [MASK]"
Step 1: "Hello world  [MASK] [MASK] [MASK]"
Step 2: "Hello world  how   are  [MASK]"
Step 3: "Hello world  how   are   you?"

256 tokens → 6–10 forward passes instead of 256. With Self-Conditioning, quality stays comparable at even fewer steps.


What shipped (v2.0.5)

The project is very new (~3 days old, one developer) and has no benchmarks yet due to limited compute resources. But the codebase is surprisingly complete:

Core architecture (60 features):

  • GQA, MLA (DeepSeek-style), Sliding Window, Linear Attention

  • AdaLN-Zero timestep conditioning (DiT-style)

  • Self-Conditioning, Absorbing Diffusion, Remasking

  • MoE: top-K, Expert Choice (Google Switch), Soft MoE (Google Brain 2023), Shared Expert (DeepSeek)

  • LoRA variants: standard, DoRA, QLoRA, rsLoRA, LoRA+

  • Advanced: KV Cache, MTP head, Model Merging (SLERP/DARE/TIES), PPO/Reward head, GGUF export stub, RAG injector, Dynamic batching

omg_data — Automated data pipeline:

pipe = DataPipeline(language="tr", task="chat", size_gb=5, tokenizer="gpt2")
dataset = pipe.build()  # finds → downloads → cleans → tokenizes automatically

Supports 15+ languages, 6 task types, full cleaning pipeline (dedup, HTML, URL, unicode, lang filter).

omg_hybridomga — Unified training engine:

  • All 6 LoRA methods in one package

  • Novel OMGa (OMG Adaptive LoRA): per-token learned gate with dual-rank adapters

  • VRAM guard, OOM recovery, MorphicMemory (Markov allocation prediction + tensor reuse)

  • SpectraOptimizer (FFT-domain adaptive AdamW), ResonanceScheduler (gradient-spectrum self-tuning LR)

  • GradientHarmonics (wavelet noise injection), NeuralProfiler (tiny LSTM predicts OOM/explode risk)

  • Unstoppable trainer — retries from checkpoint on any failure


Current state

No training benchmarks yet due to limited compute resources. The architecture is solid and the code is well-written, but:

  • No training benchmarks yet (developer has limited GPU access)

  • Some stubs not fully implemented (Flash Attention 2 flag exists but falls back to SDPA)

  • MoE not yet fully integrated into OMGConfig (listed for next release)

  • No pretrained weights — you train from scratch

The developer is actively working on it and releases are moving fast.


Installation

pip install omgformer           # core
pip install omg_data            # data pipeline
pip install omg_hybridomga      # training engine

Quick start:

from omgformer import OMGConfig, OMGModel, MaskScheduler, ParallelDecoder

cfg     = OMGConfig.from_preset("omgformer-small")  # ~87M params
model   = OMGModel(cfg)
sched   = MaskScheduler(steps=10, mask_token_id=cfg.mask_token_id, vocab_size=cfg.vocab_size)
decoder = ParallelDecoder(model, sched)


Looking for feedback

Since there are no benchmark results yet, the community’s help would be very valuable. If anyone has spare compute and wants to run experiments — even small ones on omgformer-tiny or omgformer-small — and share results here, that would help validate (or challenge) the approach.

Specific things worth testing:

  • Does loss converge normally on small datasets?

  • How does generation quality compare to a similarly-sized autoregressive model at the same step budget?

  • Any bugs in the data pipeline for non-English languages?

Happy to discuss the architecture or the diffusion LM approach in general.

2 Likes