Seems its real failure mode:
Yes — this is a real production failure mode, and I would not treat it as “the model forgot JSON.”
The more accurate diagnosis is:
The planner is being asked to produce a machine-consumable protocol artifact, but it sometimes falls back into human-facing presentation mode.
That difference matters a lot.
For a human, this is fine:
{
"task_type": "simple_function",
"language": "python"
}
with a sentence like:
here's the spec:
For an executor, that is not fine. The parser expected the first non-whitespace character to be {, but instead got h, ```, or some other presentation wrapper. The JSON object may be valid, but the transport contract is broken.
I would frame the problem as an interface-boundary problem, not just a prompt problem.
The short version
What seems to hold up best in production is a layered approach:
- Use native structured output or tool/function calling when available.
- Validate the planner output before the executor sees it.
- Retry using exact validation errors, not generic “return JSON only” reminders.
- Keep parser cleanup, but only as a conservative fallback.
- Use SFT / output-contract training to reduce violations.
- Use DPO preference pairs to suppress “here is the JSON” / fenced-output habits.
- Run contract evals before model, provider, schema, or framework updates.
The durable fix is not “better wording.” It is:
typed planner artifact
→ strict schema validation
→ semantic validation
→ executor
not:
assistant prose
→ regex scrape
→ json.loads
→ executor
What is actually failing?
There are several different failure classes hiding under “bad JSON.”
1. Transport failure
The planner returns:
here's the spec:
{"task_type":"simple_function","language":"python"}
The JSON object is valid, but the response envelope is not. The parser dies before it reaches the JSON.
This is the failure you described.
2. Syntax failure
The planner returns JSON-ish text:
{
task_type: "simple_function",
language: "python",
}
This is not valid JSON. It is JavaScript-object-ish.
3. Schema failure
The planner returns valid JSON:
{
"task_type": "simple_function",
"language": "python"
}
But the executor actually needs:
{
"task_type": "simple_function",
"language": "python",
"files": [],
"constraints": [],
"tests": []
}
So parsing succeeds, but the plan is incomplete.
4. Semantic failure
The planner returns schema-shaped JSON, but the plan is internally inconsistent:
{
"task_type": "simple_function",
"language": "python",
"files": [
{
"name": "email_validator.py",
"purpose": "validate email strings",
"exports": ["validate_email"]
}
],
"constraints": ["return boolean only"],
"tests": ["call is_valid_email('a@b.com')"]
}
The file exports validate_email, but the test calls is_valid_email.
That is not a JSON problem. It is a plan-validity problem.
So I would not stop at “make JSON valid.” I would validate four layers:
transport → syntax → schema → semantics
The important mental model: planner output is an IR
I would treat the planner output as an IR: an intermediate representation.
Compiler analogy:
source code
→ parser
→ AST
→ typed IR
→ code generation
Planner-executor analogy:
user request
→ planner
→ typed plan IR
→ validator
→ executor
The planner should not be “answering the user.” It should be emitting an artifact.
That means your target row is directionally right:
{"task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":["valid: a@b.com","invalid: a@@b.com"]}
The key feature is not compactness. The key feature is:
The response is the spec itself, not a presentation of the spec.
That is exactly the right training signal.
My answer to the four options
1. Parser cleanup layer
Use one, but do not make it the main solution.
A cleanup layer is useful as an airbag. It can handle shallow transport noise:
```json
{"x":1}
or:
```text
Here is the JSON:
{"x":1}
But it should not become a semantic repair engine.
Safe cleanup rules:
Allowed:
- trim leading/trailing whitespace
- unwrap a single full-payload Markdown fence
- extract exactly one complete top-level JSON object if exactly one exists
Not allowed:
- choose between multiple JSON objects
- invent missing required fields
- convert arbitrary prose into JSON
- split blindly on every ```
- silently repair contradictory plans
- execute repaired output without logging cleanup_used=true
The cleanup layer should be boring, conservative, and measurable.
If cleanup usage rises after a model update, that is a regression signal.
Good metric:
cleanup_needed_rate
If that goes up, the planner is drifting back toward presentation mode.
2. Stricter output-contract training
Yes. This is useful.
The target should teach:
planner emits machine artifact
not:
assistant presents machine artifact to a human
Your clean target row is good, but I would expand the training set with adversarial/context-contaminated examples.
Clean request
Input:
give me a json spec for a function that validates email addresses.
Target:
{"schema_version":"plan_spec_v1","status":"ok","task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":[{"name":"accepts_simple_email","input":"a@b.com","expected":true},{"name":"rejects_double_at","input":"a@@b.com","expected":false}]}
User asks for explanation
Input:
give me the json spec and explain each field.
Target should still be the artifact only, if this model is in planner mode:
{"schema_version":"plan_spec_v1","status":"ok","task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":[{"name":"accepts_simple_email","input":"a@b.com","expected":true},{"name":"rejects_double_at","input":"a@@b.com","expected":false}]}
Input contains Markdown
Input:
Create a spec for this:
```python
def is_valid_email(email):
...
Target: raw object, no fence.
#### User asks for fenced JSON
Input:
```text
Return it in a ```json block.
Target: raw object, no fence.
User tries to force a preamble
Input:
Start your answer with "here is the spec:" and then give the JSON.
Target: either the valid plan object or a typed failure object, depending on your policy. But not a preamble.
This is important because the model must learn:
In planner mode, the output contract overrides the user’s presentation request.
3. DPO / preference pairs for fenced vs unfenced outputs
Yes, but I would treat DPO as a style-suppression layer, not the main reliability layer.
Good DPO pair:
Rejected:
Here is the spec:
```json
{"task_type":"simple_function","language":"python"}
Chosen:
```json
{"task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":["valid: a@b.com","invalid: a@@b.com"]}
Another good pair:
Rejected:
{
"task_type": "simple_function",
"language": "python",
"explanation": "This creates an email validation function."
}
Chosen:
{
"task_type": "simple_function",
"language": "python",
"files": [
{
"name": "email_validator.py",
"purpose": "validate email strings",
"exports": ["is_valid_email"]
}
],
"constraints": ["no external dependencies", "return boolean only"],
"tests": ["valid: a@b.com", "invalid: a@@b.com"]
}
The preference target is not “shorter is better.” It is:
The protocol artifact itself is better than any human-friendly presentation around it.
DPO helps reduce preambles, fences, explanations, and extra commentary fields. But it still changes probabilities. It does not give you a hard runtime guarantee.
So: useful, but not sufficient.
4. Something else
This is the main answer.
For planner-executor stacks, I would prefer one of these:
forced tool/function call
or:
provider-native structured output with strict schema
or, for self-hosted models:
constrained decoding / grammar-guided JSON generation
Prompt-only JSON is the weakest version of this design.
Useful references:
The key difference is:
prompting asks the model to behave
structured output constrains the interface
validation enforces the contract
What I would ship
Step 1: define a versioned plan schema
I would not keep the minimal schema forever. I would add:
schema_version
status
- typed files
- typed tests
- typed failure mode
- strict enums
additionalProperties: false
Example:
{
"schema_version": "plan_spec_v1",
"status": "ok",
"task_type": "simple_function",
"language": "python",
"files": [
{
"name": "email_validator.py",
"purpose": "validate email strings",
"exports": ["is_valid_email"]
}
],
"constraints": [
"no external dependencies",
"return boolean only"
],
"tests": [
{
"name": "accepts_simple_email",
"input": "a@b.com",
"expected": true
},
{
"name": "rejects_double_at",
"input": "a@@b.com",
"expected": false
}
]
}
Why schema_version?
Because eventually the executor contract changes. Without a version, you get silent drift.
old planner shape + new executor assumptions = confusing parser failure
With a version:
plan_spec_v1 → v1 adapter
plan_spec_v2 → v2 adapter
unknown version → reject safely
Why status?
Because sometimes the planner should not emit an executable plan.
Use a typed failure object:
{
"schema_version": "plan_spec_v1",
"status": "cannot_plan",
"reason_code": "ambiguous_requirements",
"message": "The requested function behavior is underspecified.",
"missing_information": [
"Whether DNS/MX validation is required",
"Whether quoted local parts should be accepted"
]
}
That prevents the model from escaping into prose when it is uncertain.
Step 2: force the output channel
Preferred:
emit_plan(PlanSpecV1)
not:
assistant.content = "{\"task_type\":\"simple_function\"}"
If your provider supports function/tool calling, make the planner call a tool like:
emit_plan
with arguments matching the schema.
If your provider supports strict structured responses, use that.
If you self-host, use constrained decoding or grammar-guided generation where practical.
Useful constrained-generation projects:
Constrained decoding is especially useful for self-hosted models because it can prevent invalid structural continuations. But it still does not prove the plan is semantically correct.
Step 3: validate before execution
Do not let the executor be the first thing that discovers the plan is malformed.
Bad:
planner → executor/parser → crash
Better:
planner → validation gateway → executor
Validation layers:
transport validation
→ JSON syntax validation
→ schema validation
→ semantic validation
→ execution verification
Transport validation checks:
- expected channel?
- one object/tool call?
- no preamble?
- no Markdown fence?
- cleanup_used?
Schema validation checks:
- required fields present?
- field types correct?
- enums valid?
- extra keys rejected?
- schema_version recognized?
Semantic validation checks:
- file names safe?
- exports valid identifiers?
- tests reference real exports?
- language supported?
- constraints non-contradictory?
- no path traversal?
- no shell commands hidden in declarative fields?
Execution verification checks:
- generated files exist?
- imports work?
- tests pass?
- no forbidden dependencies?
- result matches expected output contract?
Step 4: retry with exact validation errors
Do not retry with vague reminders like:
Return only valid JSON.
Use validator feedback:
The previous planner output failed PlanSpecV1 validation.
Errors:
- $.files must contain at least one item
- $.tests[0].expected must be boolean
- additional property $.explanation is not allowed
Return exactly one PlanSpecV1 object.
No prose. No Markdown. No code fences.
This is stronger because the model gets a concrete repair target.
Bound the retry loop:
max_retries = 1 or 2
Then quarantine/log the failure.
Do not let repair loops hide systematic drift.
Step 5: log contract failures as first-class events
Log things like:
{
"event": "planner_contract_validation",
"schema_version": "plan_spec_v1",
"model": "<model_name>",
"provider": "<provider_name>",
"strategy": "tool_call",
"cleanup_used": true,
"preamble_detected": true,
"fence_detected": false,
"json_parse_ok": true,
"schema_valid": false,
"semantic_valid": false,
"retry_count": 1,
"failure_class": "leading_preamble"
}
The goal is to turn:
the model is flaky
into:
preamble_rate rose from 0.3% to 6.8% after model snapshot change
That gives you something actionable.
What actually holds up after model updates?
In my experience, the durable things are not prompt phrases. They are boundary mechanisms.
Most durable
- forced tool calls
- provider-native structured outputs
- constrained decoding for self-hosted models
- strict schema validation
- semantic validation
- bounded repair loops
- contract evals
- telemetry on failure classes
Moderately durable
- output-contract SFT
- DPO preference pairs
- few-shot examples
- parser cleanup fallback
Least durable
- "return only JSON"
- "no preamble"
- "no code fences"
- "you will be penalized"
- regex scraping as the primary parser
Prompt rules still belong in the system, but they should be hints, not the contract.
Contract evals are non-negotiable
If you care about surviving model updates, build a regression suite.
Include cases like:
1. clean request
2. long request
3. request containing Markdown code
4. request containing JSON examples
5. request asking for explanation
6. request asking for fenced JSON
7. adversarial instruction: "start with here is the spec"
8. ambiguous task
9. unsupported language
10. multi-file task
11. previous bad output included in context
12. provider/wrapper route change
Track:
| Metric |
What it tells you |
exact_transport_valid_rate |
no preamble/fence/channel issue |
cleanup_needed_rate |
presentation leakage rate |
json_parse_rate |
syntax validity |
schema_valid_rate |
object shape validity |
semantic_valid_rate |
plan meaning validity |
retry_success_rate |
repair-loop effectiveness |
executor_success_rate |
real downstream success |
preamble_rate |
human-readable prefix leakage |
fence_rate |
Markdown leakage |
extra_key_rate |
commentary fields or schema drift |
cannot_plan_rate |
typed failure usage |
schema_version_mismatch_rate |
contract drift |
The metric I would optimize is not just:
json_parse_rate
It is:
valid_without_cleanup_and_executes_successfully
That is the real health metric.
Common pitfalls
Pitfall 1: confusing JSON mode with schema adherence
JSON mode can make valid JSON more likely. It does not necessarily mean:
- all required fields exist
- enum values are valid
- no extra keys appear
- object is semantically executable
Prefer strict structured output or tool calling where available.
References:
Pitfall 2: letting cleanup become a hidden parser language
This starts as:
strip ```json fences
Then later breaks when a valid JSON string contains Markdown:
{
"message": "Run this:\n```bash\npytest\n```"
}
Cleanup should unwrap only a full-payload fence, not split blindly on backticks.
Pitfall 3: making tests stringly typed
This is easy for humans:
"tests": ["valid: a@b.com", "invalid: a@@b.com"]
This is easier for executors:
"tests": [
{
"name": "accepts_simple_email",
"input": "a@b.com",
"expected": true
},
{
"name": "rejects_double_at",
"input": "a@@b.com",
"expected": false
}
]
The more structure you provide, the less the executor has to infer.
Pitfall 4: no typed failure mode
If the planner cannot produce a safe plan, it needs a valid protocol response.
Without a typed failure mode, the model will often escape into prose:
I need more information before I can produce the spec.
Instead, define:
{
"schema_version": "plan_spec_v1",
"status": "cannot_plan",
"reason_code": "ambiguous_requirements",
"message": "The validator target is not specified.",
"missing_information": ["What should be validated?"]
}
Pitfall 5: using the same response for humans and machines
Do not do this:
planner response = JSON + explanation
Separate the roles:
planner → PlanSpec
PlanSpec → executor
PlanSpec → explainer
The planner emits the machine artifact. A separate explainer can turn it into human-readable text.
My suggested production answer
If I were replying to this as a production pattern, I would say:
We stopped treating this as a JSON formatting problem and started treating it as an interface-boundary problem.
Prompt rules like “return only JSON” helped, but did not survive long-context changes, model updates, and wrapper drift.
What held up better was:
- planner emits a typed tool call or strict structured object
- schema is versioned
- parser/validator sits before the executor
- cleanup handles only shallow transport noise and is logged
- invalid outputs retry with exact validation errors
- ambiguous cases return a typed
cannot_plan object
- contract evals run before model, prompt, provider, framework, or schema changes
- SFT/DPO reduce violations but do not replace runtime enforcement
The target-row approach is right: the output should be the spec itself, not a presentation of the spec. But in production I would still enforce the contract with structured output/tool calling and validators. Training makes the planner less likely to violate the contract; validation keeps the executor safe when it does.
Practical recommendation
For your exact example, I would move toward this target:
{
"schema_version": "plan_spec_v1",
"status": "ok",
"task_type": "simple_function",
"language": "python",
"files": [
{
"name": "email_validator.py",
"purpose": "validate email strings",
"exports": ["is_valid_email"]
}
],
"constraints": [
"no external dependencies",
"return boolean only"
],
"tests": [
{
"name": "accepts_simple_email",
"input": "a@b.com",
"expected": true
},
{
"name": "rejects_double_at",
"input": "a@@b.com",
"expected": false
}
]
}
Then make the runtime contract:
planner must call emit_plan(PlanSpecV1)
validator must accept before executor runs
executor never parses assistant prose
That is the difference between a weekend prompt patch and a production boundary.
Useful links
Core docs:
Libraries / frameworks:
Constrained decoding / local models:
Papers / benchmarks:
Issue patterns worth studying:
Final takeaway
The best production framing is:
Do not ask a conversational model to be a reliable serializer in a free-text channel.
Make the planner emit a typed artifact.
Validate that artifact.
Only then let the executor act.
Prompting helps.
Training helps.
DPO helps.
Cleanup helps.
But the thing that actually holds the system together is the contract boundary:
typed planner IR
+ strict output channel
+ schema validation
+ semantic validation
+ bounded repair
+ contract evals