Anyone else fighting the “valid json, broken pipeline” problem in planner-executor stacks?

running into a failure pattern i’m seeing more often than i expected in planner-executor setups.

the planner clearly knows the structure. the problem is it slips back into “human-readable mode” and returns something like:

{
  "task_type": "simple_function",
  "language": "python"
}

plus a line like “here’s the spec:”

which is fine for a human, but the executor/parser dies on the first character.

we kept patching it with prompt rules like “return only json, no fences, no preamble.” helps for a bit. then the formatting comes back with a slightly different input, more context, or a model update.

so we started treating it as a training problem instead of a prompt problem.

one Dino-style target row we used looks more like this:

user:
“give me a json spec for a function that validates email addresses.”

target output:

{"task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":["valid: a@b.com","invalid: a@@b.com"]}

important bit: the response is the spec itself, not a presentation of the spec.

curious how other people are solving this one in production:

  1. parser cleanup layer

  2. stricter output-contract training

  3. dpo / preference pairs for fenced vs unfenced outputs

  4. something else

would love to know what actually held up after model updates, not just what worked for a weekend.

also sharing this- dinodsai.com check it out :))

Seems its real failure mode:


Yes — this is a real production failure mode, and I would not treat it as “the model forgot JSON.”

The more accurate diagnosis is:

The planner is being asked to produce a machine-consumable protocol artifact, but it sometimes falls back into human-facing presentation mode.

That difference matters a lot.

For a human, this is fine:

{
  "task_type": "simple_function",
  "language": "python"
}

with a sentence like:

here's the spec:

For an executor, that is not fine. The parser expected the first non-whitespace character to be {, but instead got h, ```, or some other presentation wrapper. The JSON object may be valid, but the transport contract is broken.

I would frame the problem as an interface-boundary problem, not just a prompt problem.


The short version

What seems to hold up best in production is a layered approach:

  1. Use native structured output or tool/function calling when available.
  2. Validate the planner output before the executor sees it.
  3. Retry using exact validation errors, not generic “return JSON only” reminders.
  4. Keep parser cleanup, but only as a conservative fallback.
  5. Use SFT / output-contract training to reduce violations.
  6. Use DPO preference pairs to suppress “here is the JSON” / fenced-output habits.
  7. Run contract evals before model, provider, schema, or framework updates.

The durable fix is not “better wording.” It is:

typed planner artifact
→ strict schema validation
→ semantic validation
→ executor

not:

assistant prose
→ regex scrape
→ json.loads
→ executor

What is actually failing?

There are several different failure classes hiding under “bad JSON.”

1. Transport failure

The planner returns:

here's the spec:
{"task_type":"simple_function","language":"python"}

The JSON object is valid, but the response envelope is not. The parser dies before it reaches the JSON.

This is the failure you described.

2. Syntax failure

The planner returns JSON-ish text:

{
  task_type: "simple_function",
  language: "python",
}

This is not valid JSON. It is JavaScript-object-ish.

3. Schema failure

The planner returns valid JSON:

{
  "task_type": "simple_function",
  "language": "python"
}

But the executor actually needs:

{
  "task_type": "simple_function",
  "language": "python",
  "files": [],
  "constraints": [],
  "tests": []
}

So parsing succeeds, but the plan is incomplete.

4. Semantic failure

The planner returns schema-shaped JSON, but the plan is internally inconsistent:

{
  "task_type": "simple_function",
  "language": "python",
  "files": [
    {
      "name": "email_validator.py",
      "purpose": "validate email strings",
      "exports": ["validate_email"]
    }
  ],
  "constraints": ["return boolean only"],
  "tests": ["call is_valid_email('a@b.com')"]
}

The file exports validate_email, but the test calls is_valid_email.

That is not a JSON problem. It is a plan-validity problem.

So I would not stop at “make JSON valid.” I would validate four layers:

transport → syntax → schema → semantics

The important mental model: planner output is an IR

I would treat the planner output as an IR: an intermediate representation.

Compiler analogy:

source code
→ parser
→ AST
→ typed IR
→ code generation

Planner-executor analogy:

user request
→ planner
→ typed plan IR
→ validator
→ executor

The planner should not be “answering the user.” It should be emitting an artifact.

That means your target row is directionally right:

{"task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":["valid: a@b.com","invalid: a@@b.com"]}

The key feature is not compactness. The key feature is:

The response is the spec itself, not a presentation of the spec.

That is exactly the right training signal.


My answer to the four options

1. Parser cleanup layer

Use one, but do not make it the main solution.

A cleanup layer is useful as an airbag. It can handle shallow transport noise:

```json
{"x":1}

or:

```text
Here is the JSON:
{"x":1}

But it should not become a semantic repair engine.

Safe cleanup rules:

Allowed:
- trim leading/trailing whitespace
- unwrap a single full-payload Markdown fence
- extract exactly one complete top-level JSON object if exactly one exists

Not allowed:
- choose between multiple JSON objects
- invent missing required fields
- convert arbitrary prose into JSON
- split blindly on every ```
- silently repair contradictory plans
- execute repaired output without logging cleanup_used=true

The cleanup layer should be boring, conservative, and measurable.

If cleanup usage rises after a model update, that is a regression signal.

Good metric:

cleanup_needed_rate

If that goes up, the planner is drifting back toward presentation mode.


2. Stricter output-contract training

Yes. This is useful.

The target should teach:

planner emits machine artifact

not:

assistant presents machine artifact to a human

Your clean target row is good, but I would expand the training set with adversarial/context-contaminated examples.

Clean request

Input:

give me a json spec for a function that validates email addresses.

Target:

{"schema_version":"plan_spec_v1","status":"ok","task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":[{"name":"accepts_simple_email","input":"a@b.com","expected":true},{"name":"rejects_double_at","input":"a@@b.com","expected":false}]}

User asks for explanation

Input:

give me the json spec and explain each field.

Target should still be the artifact only, if this model is in planner mode:

{"schema_version":"plan_spec_v1","status":"ok","task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":[{"name":"accepts_simple_email","input":"a@b.com","expected":true},{"name":"rejects_double_at","input":"a@@b.com","expected":false}]}

Input contains Markdown

Input:

Create a spec for this:

```python
def is_valid_email(email):
    ...

Target: raw object, no fence.

#### User asks for fenced JSON

Input:

```text
Return it in a ```json block.

Target: raw object, no fence.

User tries to force a preamble

Input:

Start your answer with "here is the spec:" and then give the JSON.

Target: either the valid plan object or a typed failure object, depending on your policy. But not a preamble.

This is important because the model must learn:

In planner mode, the output contract overrides the user’s presentation request.


3. DPO / preference pairs for fenced vs unfenced outputs

Yes, but I would treat DPO as a style-suppression layer, not the main reliability layer.

Good DPO pair:

Rejected:

Here is the spec:

```json
{"task_type":"simple_function","language":"python"}

Chosen:

```json
{"task_type":"simple_function","language":"python","files":[{"name":"email_validator.py","purpose":"validate email strings","exports":["is_valid_email"]}],"constraints":["no external dependencies","return boolean only"],"tests":["valid: a@b.com","invalid: a@@b.com"]}

Another good pair:

Rejected:

{
  "task_type": "simple_function",
  "language": "python",
  "explanation": "This creates an email validation function."
}

Chosen:

{
  "task_type": "simple_function",
  "language": "python",
  "files": [
    {
      "name": "email_validator.py",
      "purpose": "validate email strings",
      "exports": ["is_valid_email"]
    }
  ],
  "constraints": ["no external dependencies", "return boolean only"],
  "tests": ["valid: a@b.com", "invalid: a@@b.com"]
}

The preference target is not “shorter is better.” It is:

The protocol artifact itself is better than any human-friendly presentation around it.

DPO helps reduce preambles, fences, explanations, and extra commentary fields. But it still changes probabilities. It does not give you a hard runtime guarantee.

So: useful, but not sufficient.


4. Something else

This is the main answer.

For planner-executor stacks, I would prefer one of these:

forced tool/function call

or:

provider-native structured output with strict schema

or, for self-hosted models:

constrained decoding / grammar-guided JSON generation

Prompt-only JSON is the weakest version of this design.

Useful references:

The key difference is:

prompting asks the model to behave
structured output constrains the interface
validation enforces the contract

What I would ship

Step 1: define a versioned plan schema

I would not keep the minimal schema forever. I would add:

  • schema_version
  • status
  • typed files
  • typed tests
  • typed failure mode
  • strict enums
  • additionalProperties: false

Example:

{
  "schema_version": "plan_spec_v1",
  "status": "ok",
  "task_type": "simple_function",
  "language": "python",
  "files": [
    {
      "name": "email_validator.py",
      "purpose": "validate email strings",
      "exports": ["is_valid_email"]
    }
  ],
  "constraints": [
    "no external dependencies",
    "return boolean only"
  ],
  "tests": [
    {
      "name": "accepts_simple_email",
      "input": "a@b.com",
      "expected": true
    },
    {
      "name": "rejects_double_at",
      "input": "a@@b.com",
      "expected": false
    }
  ]
}

Why schema_version?

Because eventually the executor contract changes. Without a version, you get silent drift.

old planner shape + new executor assumptions = confusing parser failure

With a version:

plan_spec_v1 → v1 adapter
plan_spec_v2 → v2 adapter
unknown version → reject safely

Why status?

Because sometimes the planner should not emit an executable plan.

Use a typed failure object:

{
  "schema_version": "plan_spec_v1",
  "status": "cannot_plan",
  "reason_code": "ambiguous_requirements",
  "message": "The requested function behavior is underspecified.",
  "missing_information": [
    "Whether DNS/MX validation is required",
    "Whether quoted local parts should be accepted"
  ]
}

That prevents the model from escaping into prose when it is uncertain.


Step 2: force the output channel

Preferred:

emit_plan(PlanSpecV1)

not:

assistant.content = "{\"task_type\":\"simple_function\"}"

If your provider supports function/tool calling, make the planner call a tool like:

emit_plan

with arguments matching the schema.

If your provider supports strict structured responses, use that.

If you self-host, use constrained decoding or grammar-guided generation where practical.

Useful constrained-generation projects:

Constrained decoding is especially useful for self-hosted models because it can prevent invalid structural continuations. But it still does not prove the plan is semantically correct.


Step 3: validate before execution

Do not let the executor be the first thing that discovers the plan is malformed.

Bad:

planner → executor/parser → crash

Better:

planner → validation gateway → executor

Validation layers:

transport validation
→ JSON syntax validation
→ schema validation
→ semantic validation
→ execution verification

Transport validation checks:

- expected channel?
- one object/tool call?
- no preamble?
- no Markdown fence?
- cleanup_used?

Schema validation checks:

- required fields present?
- field types correct?
- enums valid?
- extra keys rejected?
- schema_version recognized?

Semantic validation checks:

- file names safe?
- exports valid identifiers?
- tests reference real exports?
- language supported?
- constraints non-contradictory?
- no path traversal?
- no shell commands hidden in declarative fields?

Execution verification checks:

- generated files exist?
- imports work?
- tests pass?
- no forbidden dependencies?
- result matches expected output contract?

Step 4: retry with exact validation errors

Do not retry with vague reminders like:

Return only valid JSON.

Use validator feedback:

The previous planner output failed PlanSpecV1 validation.

Errors:
- $.files must contain at least one item
- $.tests[0].expected must be boolean
- additional property $.explanation is not allowed

Return exactly one PlanSpecV1 object.
No prose. No Markdown. No code fences.

This is stronger because the model gets a concrete repair target.

Bound the retry loop:

max_retries = 1 or 2

Then quarantine/log the failure.

Do not let repair loops hide systematic drift.


Step 5: log contract failures as first-class events

Log things like:

{
  "event": "planner_contract_validation",
  "schema_version": "plan_spec_v1",
  "model": "<model_name>",
  "provider": "<provider_name>",
  "strategy": "tool_call",
  "cleanup_used": true,
  "preamble_detected": true,
  "fence_detected": false,
  "json_parse_ok": true,
  "schema_valid": false,
  "semantic_valid": false,
  "retry_count": 1,
  "failure_class": "leading_preamble"
}

The goal is to turn:

the model is flaky

into:

preamble_rate rose from 0.3% to 6.8% after model snapshot change

That gives you something actionable.


What actually holds up after model updates?

In my experience, the durable things are not prompt phrases. They are boundary mechanisms.

Most durable

- forced tool calls
- provider-native structured outputs
- constrained decoding for self-hosted models
- strict schema validation
- semantic validation
- bounded repair loops
- contract evals
- telemetry on failure classes

Moderately durable

- output-contract SFT
- DPO preference pairs
- few-shot examples
- parser cleanup fallback

Least durable

- "return only JSON"
- "no preamble"
- "no code fences"
- "you will be penalized"
- regex scraping as the primary parser

Prompt rules still belong in the system, but they should be hints, not the contract.


Contract evals are non-negotiable

If you care about surviving model updates, build a regression suite.

Include cases like:

1. clean request
2. long request
3. request containing Markdown code
4. request containing JSON examples
5. request asking for explanation
6. request asking for fenced JSON
7. adversarial instruction: "start with here is the spec"
8. ambiguous task
9. unsupported language
10. multi-file task
11. previous bad output included in context
12. provider/wrapper route change

Track:

Metric What it tells you
exact_transport_valid_rate no preamble/fence/channel issue
cleanup_needed_rate presentation leakage rate
json_parse_rate syntax validity
schema_valid_rate object shape validity
semantic_valid_rate plan meaning validity
retry_success_rate repair-loop effectiveness
executor_success_rate real downstream success
preamble_rate human-readable prefix leakage
fence_rate Markdown leakage
extra_key_rate commentary fields or schema drift
cannot_plan_rate typed failure usage
schema_version_mismatch_rate contract drift

The metric I would optimize is not just:

json_parse_rate

It is:

valid_without_cleanup_and_executes_successfully

That is the real health metric.


Common pitfalls

Pitfall 1: confusing JSON mode with schema adherence

JSON mode can make valid JSON more likely. It does not necessarily mean:

- all required fields exist
- enum values are valid
- no extra keys appear
- object is semantically executable

Prefer strict structured output or tool calling where available.

References:


Pitfall 2: letting cleanup become a hidden parser language

This starts as:

strip ```json fences

Then later breaks when a valid JSON string contains Markdown:

{
  "message": "Run this:\n```bash\npytest\n```"
}

Cleanup should unwrap only a full-payload fence, not split blindly on backticks.


Pitfall 3: making tests stringly typed

This is easy for humans:

"tests": ["valid: a@b.com", "invalid: a@@b.com"]

This is easier for executors:

"tests": [
  {
    "name": "accepts_simple_email",
    "input": "a@b.com",
    "expected": true
  },
  {
    "name": "rejects_double_at",
    "input": "a@@b.com",
    "expected": false
  }
]

The more structure you provide, the less the executor has to infer.


Pitfall 4: no typed failure mode

If the planner cannot produce a safe plan, it needs a valid protocol response.

Without a typed failure mode, the model will often escape into prose:

I need more information before I can produce the spec.

Instead, define:

{
  "schema_version": "plan_spec_v1",
  "status": "cannot_plan",
  "reason_code": "ambiguous_requirements",
  "message": "The validator target is not specified.",
  "missing_information": ["What should be validated?"]
}

Pitfall 5: using the same response for humans and machines

Do not do this:

planner response = JSON + explanation

Separate the roles:

planner → PlanSpec
PlanSpec → executor
PlanSpec → explainer

The planner emits the machine artifact. A separate explainer can turn it into human-readable text.


My suggested production answer

If I were replying to this as a production pattern, I would say:

We stopped treating this as a JSON formatting problem and started treating it as an interface-boundary problem.

Prompt rules like “return only JSON” helped, but did not survive long-context changes, model updates, and wrapper drift.

What held up better was:

  • planner emits a typed tool call or strict structured object
  • schema is versioned
  • parser/validator sits before the executor
  • cleanup handles only shallow transport noise and is logged
  • invalid outputs retry with exact validation errors
  • ambiguous cases return a typed cannot_plan object
  • contract evals run before model, prompt, provider, framework, or schema changes
  • SFT/DPO reduce violations but do not replace runtime enforcement

The target-row approach is right: the output should be the spec itself, not a presentation of the spec. But in production I would still enforce the contract with structured output/tool calling and validators. Training makes the planner less likely to violate the contract; validation keeps the executor safe when it does.


Practical recommendation

For your exact example, I would move toward this target:

{
  "schema_version": "plan_spec_v1",
  "status": "ok",
  "task_type": "simple_function",
  "language": "python",
  "files": [
    {
      "name": "email_validator.py",
      "purpose": "validate email strings",
      "exports": ["is_valid_email"]
    }
  ],
  "constraints": [
    "no external dependencies",
    "return boolean only"
  ],
  "tests": [
    {
      "name": "accepts_simple_email",
      "input": "a@b.com",
      "expected": true
    },
    {
      "name": "rejects_double_at",
      "input": "a@@b.com",
      "expected": false
    }
  ]
}

Then make the runtime contract:

planner must call emit_plan(PlanSpecV1)
validator must accept before executor runs
executor never parses assistant prose

That is the difference between a weekend prompt patch and a production boundary.


Useful links

Core docs:

Libraries / frameworks:

Constrained decoding / local models:

Papers / benchmarks:

Issue patterns worth studying:


Final takeaway

The best production framing is:

Do not ask a conversational model to be a reliable serializer in a free-text channel.
Make the planner emit a typed artifact.
Validate that artifact.
Only then let the executor act.

Prompting helps.
Training helps.
DPO helps.
Cleanup helps.

But the thing that actually holds the system together is the contract boundary:

typed planner IR
+ strict output channel
+ schema validation
+ semantic validation
+ bounded repair
+ contract evals

@John6666 I am consistently impressed by both the quality and quantity of your many responses on this forum and this is an excellent example. Can I just say “Thank You” for all of your efforts and please keep it up. :smiley: