This is a great dataset. The 'cross-domain causality leap' is something we see constantly in brand monitoring—e.g. an LLM seeing a pricing page for 'Product A' and a feature list for 'Product B' and confidently asserting 'Product A has Feature B for $X'.
One edge case you might want to add: *Temporal Merging*. We often see models take a '2024 Roadmap' and a '2023 Release Note' and halluncinate that the roadmap features were released in 2023. It's valid syntax, valid entities, but impossible chronology.
Are you planning to expand this to RAG-specific failures (where the context retrieval causes the mix-up) or focusing purely on model-internal logic gaps?
That's a great example — the "Product A + Product B pricing merge"
is exactly the kind of structurally valid but impossible composition
I was trying to isolate.
I really like the "Temporal Merging" framing.
You're right: roadmap + release notes = syntactically consistent,
entity-valid, but chronologically impossible.
I haven't explicitly modeled temporal integrity yet,
but that seems like a natural extension of the cross-domain tests.
Regarding RAG:
So far the focus has been on model-internal structural logic gaps.
I haven't built retrieval-aware tests yet.
That said, I suspect many RAG failures are just amplified
cross-document merging errors, so a temporal integrity layer
might actually generalize well there.
If you have examples from brand monitoring contexts,
I'd love to add them as new regression cases.
This is a great dataset. The 'cross-domain causality leap' is something we see constantly in brand monitoring—e.g. an LLM seeing a pricing page for 'Product A' and a feature list for 'Product B' and confidently asserting 'Product A has Feature B for $X'.
One edge case you might want to add: *Temporal Merging*. We often see models take a '2024 Roadmap' and a '2023 Release Note' and halluncinate that the roadmap features were released in 2023. It's valid syntax, valid entities, but impossible chronology.
Are you planning to expand this to RAG-specific failures (where the context retrieval causes the mix-up) or focusing purely on model-internal logic gaps?
That's a great example — the "Product A + Product B pricing merge" is exactly the kind of structurally valid but impossible composition I was trying to isolate.
I really like the "Temporal Merging" framing. You're right: roadmap + release notes = syntactically consistent, entity-valid, but chronologically impossible.
I haven't explicitly modeled temporal integrity yet, but that seems like a natural extension of the cross-domain tests.
Regarding RAG: So far the focus has been on model-internal structural logic gaps. I haven't built retrieval-aware tests yet.
That said, I suspect many RAG failures are just amplified cross-document merging errors, so a temporal integrity layer might actually generalize well there.
If you have examples from brand monitoring contexts, I'd love to add them as new regression cases.