OpenEnv · Negotiation Playground

Watch agents haggle.
Step in yourself.

A negotiation environment with observable tells and hidden reservation prices. Buyer and seller are both LLMs — Sauda on the buy side (Llama-3.1-8B + QLoRA, trained SFT → GRPO → DPO/RLAIF), Gemma-4-E4B on the sell side. Strategy improves through self-play. Drop in as a seller, watch the arena, or scrub a replay.

Powered by RLAIF OpenEnv-compliant 8B · QLoRA 8 tasks · 4 personas
Headline result

Sauda v2 beats the 8B base by 7.4% mean surplus

Same seller (Gemma-4-E4B), same seeds, same tasks. n=30 episodes per task. Sauda was trained on top of Llama-3.1-8B-Instruct with SFT + GRPO; the table below shows it outperforms the base model on every task it was trained against, and survives the seller-quality eval (5 of 6 acceptance criteria pass).

BuyerTellssingle_dealasymmetricamazonMeanDealsRounds
Llama-3.2-3B baseON0.7220.7310.2580.5701.002.2
Llama-3.1-8B baseON0.8180.7870.4300.6780.993.1
Sauda v2 (8B SFT+GRPO)OFF0.8350.8270.5210.7280.916.0
Sauda v2 (8B SFT+GRPO)ON0.8100.7680.5070.6950.886.0

Reading this: 3B → 8B base buys you +19% mean surplus. Training on 8B (SFT+GRPO) buys you another +7% AND ~2× longer negotiations — base models capitulate fast (2-3 rounds), Sauda actually plays the game. Sauda's deal rate (0.91) is a feature, not a bug — it walks when offers are bad. Tells channel ON underperforms tells OFF; reported as a kept negative result. Full transcripts: PayMyBills/scaling-eval-runs.

Training

SFT → GRPO → DPO/RLAIF

The buyer adapter is trained in three stages on top of Llama-3.1-8B-Instruct. SFT teaches strict-JSON Hinglish output. GRPO drives reward against the live env. DPO refines on Claude-judged preference pairs. Trainer state for the GRPO stage is on HF — anyone can curl it.

GRPO reward
0.97 peak

30 optimization steps, mean reward 0.94 across the run. Entropy fell 0.51 → 0.42 as the policy concentrated. Full log_history: trainer_state.json

Scaling-ladder win
+7.4% vs 8B base

Mean surplus across single_deal / asymmetric / amazon. Same seller, same seeds. Doubles the 3B base on the amazon task (0.258 → 0.521).

Seller quality
5 / 6 passing

Acceptance criteria for the Gemma-4-E4B seller: never accepts below reservation, never leaks reservation, monotonic counters, etc. Dataset: seller-quality-runs

The environment

8 tasks. 4 seller personas. 1 OpenEnv API.

From symmetric one-shot deals to multi-buyer marketplaces. Asymmetric information, hidden deadlines, deceptive sellers leaking poker-style tells, career history that follows the buyer across 10 deals. Every task graded with deterministic surplus + deal-rate reward.

NameDifficultyPersonaWhat it tests
single_dealEasydefaultBuyer negotiates one deal. Symmetric information. No career history. Seller concedes at moderate rate.
asymmetric_pressureMediumdefaultBuyer has hidden hard deadline at round 5. Seller has hidden inventory pressure. Agent must infer seller urgency from offer velocity and close before deadline.
career_10HarddefaultBuyer plays 10 consecutive deals against same seller. Career history active. Seller adapts concession rate based on buyer's historical capitulation rate. Agent must manage reputation across episodes.
deceptive_sellerHarddeceptiveSeller bluffs about demand, fakes urgency, anchors 15% higher. Tells leak deception cues -- verbal over-justification, fidgeting, erratic concessions. Agent must read through the bluffs.
impatient_sellerMediumimpatientSeller concedes fast but walks fast. Shorter patience window. Agent must close quickly or risk losing the deal. Front-loaded concession pattern is the key tell.
collaborative_sellerEasycollaborativeSeller seeks fair deals, concedes toward midpoint. Lower anchor, tighter margins. Agent should reciprocate to maximize joint surplus. Tests whether agent adapts to cooperative opponents.
read_the_tellsExpertdeceptiveDeceptive seller with strong tells. Agent gets bonus score for exploiting tells -- closing below midpoint when deception cues are high indicates the agent read the bluff. Game theory meets poker.
marketplace_arenaExpertdefaultMulti-buyer marketplace: 2-3 buyers compete for the same item from one seller. Buyers can signal cooperation or competition. Seller plays buyers against each other. Facebook Marketplace dynamics.
OpenEnv API

The endpoints judges run against

FastAPI server, Docker container, Hugging Face Space. POST /reset to start. POST /step to play. GET /score to grade. Real-time streams over WebSocket. Multi-buyer arenas. Counterfactual replays. Interactive Swagger →

POST/resetStart an episode
POST/stepSubmit buyer action
GET/stateFull env state
GET/scoreGraded score
GET/tasksList tasks
WS/ws/{session}Real-time stream
GET/leaderboardScore board
POST/leaderboard/recordRecord a score
POST/counterfactualWhat-if replay
POST/arena/createMulti-buyer arena
POST/arena/joinJoin arena
POST/arena/stepArena step
GET/arena/stateArena state
POST/highlightExtract seller tells
Artifacts on Hugging Face

Everything is durable. Anyone can reproduce.

Adapter

PayMyBills/bestdealbot-v2

Llama-3.1-8B + QLoRA, SFT+GRPO. trainer_state.json + last-checkpoint live for verification.

Open on HF →
Eval datasets

scaling-eval-runs

Full transcripts of the 3B / 8B / Sauda v2 scaling ladder. n=30 per task.

Open on HF →
Hackathon journal

The blog with all receipts

Bugs, the four-hour rollout we lost to a bash typo, the ablation that disproved our own hypothesis, written live.

Read on GitHub →
Training notebooks

One-click reproduce

Colab notebooks for SFT+GRPO and for DPO/RLAIF. T4-friendly, runnable end-to-end.

Open in Colab →