Looking at the matrix, all green, yellow, light blue and dark blue are OK outcomes. Red are crashes (stack overflow with 10000 nested arrays, for example), and dark brown are valid JSON that didn't parse (things like UTF-8 mishandling). The issues aren't really JSON-specific.
JSON parsers may also permit various additional variations which are very explicitly not JSON. This means that they may accept a handcrafted, technically invalid JSON document. However, a JSON encoder may never generate a document containing such "extensions", as this would not be JSON.
This concept of having parsers accept more than necessary follows best practices of robustness. As the robustness principle goes: "Be conservative in what you do, be liberal in what you accept from others".
"The Harmful Consequences of the Robustness Principle" https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0...
The principle is even more harmful, because it sounds so logical. If many JSON parser accept your JSON object which is not valid JSON, any new parser that doesn't accept it will be booed as a faulty parser.
If your input is not according to spec, throw an error. The sender is wrong. They deserve to know it and need to fix their output.
In the large colored matrix, the following colors mean everything is fine: Green, yellow, light blue and deep blue.
Red are crashes (things like 10000 nested arrays causing a stack overflow—this is a non-JSON-specific parser bug), and dark brown are constructs that should have been supported but weren't (things like UTF-8 handling, which is again non-JSON specific parser bugs).
Writing parsers can be tricky, but JSON is certainly not a hard format to parse.
Accepting invalid or ambiguous or undefined JSON is not an acceptable behavior. It means bugs get swallowed up and you can't reliably round trip data.
Just to make it explicit (and without inserting any personal judgement into the conversation myself): JSON parsers should reject things like trailing commas after final array elements because it will encourage people to emit trailing commas?
Having asked the question (and now explicitly freeing myself to talk values) it's new to me -- a solid and rare objection to the Robustness Principle. Maybe common enough in these sorts of discussions, though? Anyway, partial as I might be to trailing commas, I do quite like the "JSON's universality shall not be compromised" argument.
Accepting trailing commas in JSON isn't as big a deal as having two different opinions about what a valid document is. But you might think a trailing comma could indicate a hand-edited document that's missing an important field or array element.
If you put out JSON that gets misparsed, you either generated invalid JSON, or the parser is faulty. Nothing around that.
This has nothing to do with whether parsers have flexibility to accept additional constructs, which is extremely common for a parser to do.
Sorry for the misquote, but does it get to the heart of your objection?
I'm torn here. On the one hand I want to say "Those are not languages one typically writes parsers in," but that's a really muddled argument:
1. People "parse" things often in bash/awk because they have to -- because bash etc deal in unstructured data.
2. Maybe "reasonable" languages should be trivially parseable so we can do it in Bash (etc).
I'm kinda torn. On the one hand bash is unreasonably effective, on the other I want data types to be opaque so people don't even try to parse them... would love to hear arguments though.
If you want to deal with JSON, I'd recommend jq as an easy manipulation tool.
I have, however, written a JSON parser before at a previous company in C++ (they didn't want to pull in a library). It wasn't particularly hard.
And yes, like any other parser, it accepted additional non-JSON constructs. This was simply because it was take additional work to error out on those constructs, which would be a waste of time.
It doesn't matter how sensible a format is, those tools are simply not appropriate to write a parser in.
AWK is a language designed for data parsing and processing. That is what it is designed to do.
How did you solve the parsing of arbitrarily nested structures?
https://github.com/google/protobuf/tree/master/conformance
Here are some of the quirks of protobuf:
- non-repeated fields can occur multiple times on
the wire -- the last value "wins".
- you have to be able to handle unknown fields, including
unknown groups that can be nested arbitrarily.
- repeated numbers have two different wire formats (packed
and non-packed), you have be able to handle both.
- when serializing, all signed integers need to be sign-
extended to 64 bits, to support interop between different
integer types.
- you have to bounds-check delimited fields to make sure
they don't violate the bounds of submessages you are
already in.
I do think protobuf is a great technology overall. But it has some complexities too; I wouldn't want to oversell its simplicity and have people be unpleasantly surprised when they come across them later. :)FWIW, I think it'd be awesome if the only wire integer format were a bignum, in order to support full interoperability between integer types. Maybe even do that same for floats, too …
Even if they were, it would still be easy to parse!
There's a whole spectrum of unofficial parser "helpfulness" here, with HTML 4 being an extreme case of parsers filled with hacks to deal with existing broken data, protobufs being an extreme case of parsers doing the One and Only True Thing, and JSON mostly toward the same end of the spectrum as protobufs, but a bit less so.
JSON hits a sweet spot of being very easy for computers to deal with almost all the time, while also being reasonably easy for humans to read and write.
I was going to add “if you started with that as the spec, it wouldn’t be hard to design something better than JSON” but real examples like YAML are pretty awkward, so probably it’s a harder problem than it seems.
PHP serialization is better here, everything is type:value or type:length:value, although strings do have quotes around them, because their byte length is known, internal quotes need not be escaped. You can still have issues with genrating and parsing the human readible numbers properly (floating point is always fun, and integers may have some bit size limit I don't recall), but you don't need to worry about quoting Unicode values properly.
Protocol buffers have clear length indications, so that's easier, but it's not a 'self documenting' format, you need to have the description file to parse an encoded value. The end result is usually many fewer bits though.
Protobuf and similar are binary formats so don't have this limitation.
Canonical S-expression are both human-readable & length-prefixed. They do this by have an advanced representation which is human-friendly:
(data (looks "like this" |YWluJ3QgaXQgY29vbD8=|))
And a canonical representation which is length-prefixed: (4:data(5:looks9:like this14:ain't it cool?))
Certainly no data format for data exchange between systems, especially untrusted sources.
That said, I've still been bitten by the Python implementation on the Mac acting differently from the C++ implementation on Linux, although I can't remember exactly what the issue was right now.
1) implementation matters
2) "simple" specs never really are
It's definitely important to have documents like this one that explore the edge cases and the differences between implementations, but you can replace "JSON" in the introductory paragraph with any other serialization format, encoding standard, or IPC protocol and it would remain true:
"<format> is not the easy, idealised format as many do believe. [There are not] two libraries that exhibit the very same behaviour. Moreover, [...] edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because <format> libraries rely on specifications that have evolved over time and that left many details loosely specified or not specified at all."