Note that they don't actually suggest that the XML needs to be VALID!
My guess was that JSON requires more characters to be escaped than XML-ish syntax does, plus matching opening and closing tags makes it a little easier for the LLM not to lose track of which string corresponds to which key.
<instructions>
...
...
</instructions>
can be much easier than
{
"instructions": "..\n...\n"
}
especially when there are newlines, quotes and unicode
I would suspect that a single attention layer won't be able to figure out to which token a token for an opening bracket should attend the most to. Think of {"x": {y: 1}} so with only one layer of attention, can the token for the first opening bracket successfully attend to exactly the matching closing bracket?
I wonder if RNNs work better with JSON or XML. Or maybe they are just fine with both of them because a RNN can have some stack-like internal state that can match brackets?
Probably, it would be a really cool research direction to measure how well Transformer-Mamba hybrid models like Jamba perform on structured input/output formats like JSON and XML and compare them. For the LLM era, I could only find papers that do this evaluation with transformer-based LLMs. Damn, I'd love to work at a place that does this kind of research, but guess I'm stuck with my current boring job now :D Born to do cutting-edge research, forced to write CRUD apps with some "AI sprinkled in". Anyone hiring here?