A negotiation environment with observable tells and hidden reservation prices. Buyer and seller are both LLMs — Sauda on the buy side (Llama-3.1-8B + QLoRA, trained SFT → GRPO → DPO/RLAIF), Gemma-4-E4B on the sell side. Strategy improves through self-play. Drop in as a seller, watch the arena, or scrub a replay.
Same seller (Gemma-4-E4B), same seeds, same tasks. n=30 episodes per task. Sauda was trained on top of Llama-3.1-8B-Instruct with SFT + GRPO; the table below shows it outperforms the base model on every task it was trained against, and survives the seller-quality eval (5 of 6 acceptance criteria pass).
| Buyer | Tells | single_deal | asymmetric | amazon | Mean | Deals | Rounds |
|---|---|---|---|---|---|---|---|
| Llama-3.2-3B base | ON | 0.722 | 0.731 | 0.258 | 0.570 | 1.00 | 2.2 |
| Llama-3.1-8B base | ON | 0.818 | 0.787 | 0.430 | 0.678 | 0.99 | 3.1 |
| Sauda v2 (8B SFT+GRPO) | OFF | 0.835 | 0.827 | 0.521 | 0.728 | 0.91 | 6.0 |
| Sauda v2 (8B SFT+GRPO) | ON | 0.810 | 0.768 | 0.507 | 0.695 | 0.88 | 6.0 |
Reading this: 3B → 8B base buys you +19% mean surplus. Training on 8B (SFT+GRPO) buys you another +7% AND ~2× longer negotiations — base models capitulate fast (2-3 rounds), Sauda actually plays the game. Sauda's deal rate (0.91) is a feature, not a bug — it walks when offers are bad. Tells channel ON underperforms tells OFF; reported as a kept negative result. Full transcripts: PayMyBills/scaling-eval-runs.
The buyer adapter is trained in three stages on top of Llama-3.1-8B-Instruct. SFT teaches strict-JSON Hinglish output. GRPO drives reward against the live env. DPO refines on Claude-judged preference pairs. Trainer state for the GRPO stage is on HF — anyone can curl it.
30 optimization steps, mean reward 0.94 across the run. Entropy fell 0.51 → 0.42 as the policy concentrated. Full log_history: trainer_state.json
Mean surplus across single_deal / asymmetric / amazon. Same seller, same seeds. Doubles the 3B base on the amazon task (0.258 → 0.521).
Acceptance criteria for the Gemma-4-E4B seller: never accepts below reservation, never leaks reservation, monotonic counters, etc. Dataset: seller-quality-runs
From symmetric one-shot deals to multi-buyer marketplaces. Asymmetric information, hidden deadlines, deceptive sellers leaking poker-style tells, career history that follows the buyer across 10 deals. Every task graded with deterministic surplus + deal-rate reward.
| Name | Difficulty | Persona | What it tests |
|---|---|---|---|
single_deal | Easy | default | Buyer negotiates one deal. Symmetric information. No career history. Seller concedes at moderate rate. |
asymmetric_pressure | Medium | default | Buyer has hidden hard deadline at round 5. Seller has hidden inventory pressure. Agent must infer seller urgency from offer velocity and close before deadline. |
career_10 | Hard | default | Buyer plays 10 consecutive deals against same seller. Career history active. Seller adapts concession rate based on buyer's historical capitulation rate. Agent must manage reputation across episodes. |
deceptive_seller | Hard | deceptive | Seller bluffs about demand, fakes urgency, anchors 15% higher. Tells leak deception cues -- verbal over-justification, fidgeting, erratic concessions. Agent must read through the bluffs. |
impatient_seller | Medium | impatient | Seller concedes fast but walks fast. Shorter patience window. Agent must close quickly or risk losing the deal. Front-loaded concession pattern is the key tell. |
collaborative_seller | Easy | collaborative | Seller seeks fair deals, concedes toward midpoint. Lower anchor, tighter margins. Agent should reciprocate to maximize joint surplus. Tests whether agent adapts to cooperative opponents. |
read_the_tells | Expert | deceptive | Deceptive seller with strong tells. Agent gets bonus score for exploiting tells -- closing below midpoint when deception cues are high indicates the agent read the bluff. Game theory meets poker. |
marketplace_arena | Expert | default | Multi-buyer marketplace: 2-3 buyers compete for the same item from one seller. Buyers can signal cooperation or competition. Seller plays buyers against each other. Facebook Marketplace dynamics. |
FastAPI server, Docker container, Hugging Face Space. POST /reset to start. POST /step to play. GET /score to grade. Real-time streams over WebSocket. Multi-buyer arenas. Counterfactual replays. Interactive Swagger →
Llama-3.1-8B + QLoRA, SFT+GRPO. trainer_state.json + last-checkpoint live for verification.
Open on HF →Full transcripts of the 3B / 8B / Sauda v2 scaling ladder. n=30 per task.
Open on HF →Bugs, the four-hour rollout we lost to a bash typo, the ablation that disproved our own hypothesis, written live.
Read on GitHub →Colab notebooks for SFT+GRPO and for DPO/RLAIF. T4-friendly, runnable end-to-end.
Open in Colab →