BazaarBATNA — OpenEnv negotiation environment

Headline result

Sauda v2 beats the 8B base by 7.4% mean surplus

Same seller (Gemma-4-E4B), same seeds, same tasks. n=30 episodes per task. Sauda was trained on top of Llama-3.1-8B-Instruct with SFT + GRPO; the table below shows it outperforms the base model on every task it was trained against, and survives the seller-quality eval (5 of 6 acceptance criteria pass).

Buyer	Tells	single_deal	asymmetric	amazon	Mean	Deals	Rounds
Llama-3.2-3B base	ON	0.722	0.731	0.258	0.570	1.00	2.2
Llama-3.1-8B base	ON	0.818	0.787	0.430	0.678	0.99	3.1
Sauda v2 (8B SFT+GRPO)	OFF	0.835	0.827	0.521	0.728	0.91	6.0
Sauda v2 (8B SFT+GRPO)	ON	0.810	0.768	0.507	0.695	0.88	6.0

Reading this: 3B → 8B base buys you +19% mean surplus. Training on 8B (SFT+GRPO) buys you another +7% AND ~2× longer negotiations — base models capitulate fast (2-3 rounds), Sauda actually plays the game. Sauda's deal rate (0.91) is a feature, not a bug — it walks when offers are bad. Tells channel ON underperforms tells OFF; reported as a kept negative result. Full transcripts: PayMyBills/scaling-eval-runs.

Training

SFT → GRPO → DPO/RLAIF

The buyer adapter is trained in three stages on top of Llama-3.1-8B-Instruct. SFT teaches strict-JSON Hinglish output. GRPO drives reward against the live env. DPO refines on Claude-judged preference pairs. Trainer state for the GRPO stage is on HF — anyone can curl it.

GRPO reward

0.97 peak

30 optimization steps, mean reward 0.94 across the run. Entropy fell 0.51 → 0.42 as the policy concentrated. Full log_history: trainer_state.json

Scaling-ladder win

+7.4% vs 8B base

Mean surplus across single_deal / asymmetric / amazon. Same seller, same seeds. Doubles the 3B base on the amazon task (0.258 → 0.521).

Seller quality

5 / 6 passing

Acceptance criteria for the Gemma-4-E4B seller: never accepts below reservation, never leaks reservation, monotonic counters, etc. Dataset: seller-quality-runs

The environment

8 tasks. 4 seller personas. 1 OpenEnv API.

From symmetric one-shot deals to multi-buyer marketplaces. Asymmetric information, hidden deadlines, deceptive sellers leaking poker-style tells, career history that follows the buyer across 10 deals. Every task graded with deterministic surplus + deal-rate reward.

Name	Difficulty	Persona	What it tests
`single_deal`	Easy	`default`	Buyer negotiates one deal. Symmetric information. No career history. Seller concedes at moderate rate.
`asymmetric_pressure`	Medium	`default`	Buyer has hidden hard deadline at round 5. Seller has hidden inventory pressure. Agent must infer seller urgency from offer velocity and close before deadline.
`career_10`	Hard	`default`	Buyer plays 10 consecutive deals against same seller. Career history active. Seller adapts concession rate based on buyer's historical capitulation rate. Agent must manage reputation across episodes.
`deceptive_seller`	Hard	`deceptive`	Seller bluffs about demand, fakes urgency, anchors 15% higher. Tells leak deception cues -- verbal over-justification, fidgeting, erratic concessions. Agent must read through the bluffs.
`impatient_seller`	Medium	`impatient`	Seller concedes fast but walks fast. Shorter patience window. Agent must close quickly or risk losing the deal. Front-loaded concession pattern is the key tell.
`collaborative_seller`	Easy	`collaborative`	Seller seeks fair deals, concedes toward midpoint. Lower anchor, tighter margins. Agent should reciprocate to maximize joint surplus. Tests whether agent adapts to cooperative opponents.
`read_the_tells`	Expert	`deceptive`	Deceptive seller with strong tells. Agent gets bonus score for exploiting tells -- closing below midpoint when deception cues are high indicates the agent read the bluff. Game theory meets poker.
`marketplace_arena`	Expert	`default`	Multi-buyer marketplace: 2-3 buyers compete for the same item from one seller. Buyers can signal cooperation or competition. Seller plays buyers against each other. Facebook Marketplace dynamics.

OpenEnv API

The endpoints judges run against

FastAPI server, Docker container, Hugging Face Space. POST /reset to start. POST /step to play. GET /score to grade. Real-time streams over WebSocket. Multi-buyer arenas. Counterfactual replays. Interactive Swagger →

POST/resetStart an episode

POST/stepSubmit buyer action

GET/stateFull env state

GET/scoreGraded score

GET/tasksList tasks

WS/ws/{session}Real-time stream

GET/leaderboardScore board

POST/leaderboard/recordRecord a score

POST/counterfactualWhat-if replay

POST/arena/createMulti-buyer arena

POST/arena/joinJoin arena

POST/arena/stepArena step

GET/arena/stateArena state

POST/highlightExtract seller tells

Artifacts on Hugging Face

Everything is durable. Anyone can reproduce.

Adapter

PayMyBills/bestdealbot-v2

Llama-3.1-8B + QLoRA, SFT+GRPO. trainer_state.json + last-checkpoint live for verification.

Open on HF →

Eval datasets

scaling-eval-runs

Full transcripts of the 3B / 8B / Sauda v2 scaling ladder. n=30 per task.

Open on HF →

Hackathon journal

The blog with all receipts

Bugs, the four-hour rollout we lost to a bash typo, the ablation that disproved our own hypothesis, written live.

Read on GitHub →

Training notebooks

One-click reproduce

Colab notebooks for SFT+GRPO and for DPO/RLAIF. T4-friendly, runnable end-to-end.

Open in Colab →

Watch agents haggle.Step in yourself.