Comment by skywhopper

skywhopper Apr 22, 2018 parent

While this is true of JSON, it's also true of any other non-trivial serialization and/or encoding format. The main lessons to learn here are that:

1) implementation matters

2) "simple" specs never really are

It's definitely important to have documents like this one that explore the edge cases and the differences between implementations, but you can replace "JSON" in the introductory paragraph with any other serialization format, encoding standard, or IPC protocol and it would remain true:

"<format> is not the easy, idealised format as many do believe. [There are not] two libraries that exhibit the very same behaviour. Moreover, [...] edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because <format> libraries rely on specifications that have evolved over time and that left many details loosely specified or not specified at all."

arghwhat Apr 22, 2018

It really isn't true for JSON either. If you read the rant, most of it is simply about JSON parsers accepting additional, non-JSON syntaxes.

Looking at the matrix, all green, yellow, light blue and dark blue are OK outcomes. Red are crashes (stack overflow with 10000 nested arrays, for example), and dark brown are valid JSON that didn't parse (things like UTF-8 mishandling). The issues aren't really JSON-specific.

dagenix Apr 23, 2018

So if JSON parsers can't agree on what JSON actually is, that isn't a problem?

arghwhat Apr 23, 2018

JSON parsers all agree what JSON actually is (except for the brown colored fields in the compat matrix, which are actual JSON compat bugs).

JSON parsers may also permit various additional variations which are very explicitly not JSON. This means that they may accept a handcrafted, technically invalid JSON document. However, a JSON encoder may never generate a document containing such "extensions", as this would not be JSON.

This concept of having parsers accept more than necessary follows best practices of robustness. As the robustness principle goes: "Be conservative in what you do, be liberal in what you accept from others".

jacobparker Apr 23, 2018

> As the robustness principle goes [...]

"The Harmful Consequences of the Robustness Principle" https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0...

meuk Apr 28, 2018

I want to upvote this so many more times than I'm able to.

The principle is even more harmful, because it sounds so logical. If many JSON parser accept your JSON object which is not valid JSON, any new parser that doesn't accept it will be booed as a faulty parser.

If your input is not according to spec, throw an error. The sender is wrong. They deserve to know it and need to fix their output.

craftyguy Apr 24, 2018

The author advocates adding previously invalid extensions to the spec as they come up, as an alternate way to handle 'robustness'. How would this not rapidly lead to a bloated, buggy, unmaintainable spec?

benzoate Apr 23, 2018

I consider it more like behaviour outside of the spec is undefined, and some parsers have bugs.

dagenix Apr 23, 2018

A good spec should define all those corner cases. With Json, it's possible for two bug free libraries to take the same input and produce a different output.

arghwhat Apr 24, 2018

Only if the input is invalid JSON.

ospider Apr 23, 2018

Paring "csv" and "obj" format is also difficult. They both have "simple" specs, but neither of them has a standard spec.

jstimpfle Apr 23, 2018

They both have after-the fact specifications that (try to) codify various flavours that had already been released into the wild. (I assume you mean Wavefront Obj). It's a pity that they aren't better specified. The same holds for ini. Maybe that's just the curse of good engineering: If it's simple, it's easy to make a mess out of it.

SamReidHughes Apr 22, 2018

No, this is not true of many reasonable formats. You don't have to make an obtusely nontrivial format to encode the data JSON does.

arghwhat Apr 22, 2018

JSON is fairly trivial. The post is a nonsensical rant about parsers accepting non-JSON compliant documents (as the JSON spec specifically states that parsers may), such as trailing commas.

In the large colored matrix, the following colors mean everything is fine: Green, yellow, light blue and deep blue.

Red are crashes (things like 10000 nested arrays causing a stack overflow—this is a non-JSON-specific parser bug), and dark brown are constructs that should have been supported but weren't (things like UTF-8 handling, which is again non-JSON specific parser bugs).

Writing parsers can be tricky, but JSON is certainly not a hard format to parse.

SamReidHughes Apr 23, 2018

As a question of fact, programs put out JSON that gets misparsed by other programs. Some simply parse floating point values differently, or they treat Unicode strings incorrectly, or output them incorrectly. Different parsers have different opinions about what a document represents. This has a real world impact.

Accepting invalid or ambiguous or undefined JSON is not an acceptable behavior. It means bugs get swallowed up and you can't reliably round trip data.

repsilat Apr 23, 2018

> Accepting invalid or ambiguous or undefined JSON is not an acceptable behavior

Just to make it explicit (and without inserting any personal judgement into the conversation myself): JSON parsers should reject things like trailing commas after final array elements because it will encourage people to emit trailing commas?

Having asked the question (and now explicitly freeing myself to talk values) it's new to me -- a solid and rare objection to the Robustness Principle. Maybe common enough in these sorts of discussions, though? Anyway, partial as I might be to trailing commas, I do quite like the "JSON's universality shall not be compromised" argument.

SamReidHughes Apr 23, 2018

Postel's Law or the "Robustness Principle" is an anti-pattern in general.

Accepting trailing commas in JSON isn't as big a deal as having two different opinions about what a valid document is. But you might think a trailing comma could indicate a hand-edited document that's missing an important field or array element.

arghwhat Apr 23, 2018

Unicode and floating point misparsing are not even remotely JSON related, but are simply bugs that can occur in any parser that handles unicode or floating point. Thus, complaining about it in a "JSON is a minefield" thread is a bit silly.

If you put out JSON that gets misparsed, you either generated invalid JSON, or the parser is faulty. Nothing around that.

This has nothing to do with whether parsers have flexibility to accept additional constructs, which is extremely common for a parser to do.

Annatar Apr 23, 2018

Actually unless one is doing JavaScript, JSON is extremely difficult to parse correctly. I challenge you to write a simple, understandable JSON parser in Bourne shell or in AWK.

repsilat Apr 23, 2018

> JSON is extremely difficult to parse correctly ... in Bourne shell or in AWK.

Sorry for the misquote, but does it get to the heart of your objection?

I'm torn here. On the one hand I want to say "Those are not languages one typically writes parsers in," but that's a really muddled argument:

1. People "parse" things often in bash/awk because they have to -- because bash etc deal in unstructured data.

2. Maybe "reasonable" languages should be trivially parseable so we can do it in Bash (etc).

I'm kinda torn. On the one hand bash is unreasonably effective, on the other I want data types to be opaque so people don't even try to parse them... would love to hear arguments though.

arghwhat Apr 23, 2018

You definitely shouldn't write a parser for anything in bash.

If you want to deal with JSON, I'd recommend jq as an easy manipulation tool.

Annatar Apr 23, 2018

I wrote Bourne shell and both of you assumed bash. Horrid state of affairs in today’s IT. All hail Linux hegemony!

And AWK, if it’s not easily parsable with the language specifically designed for parsing data, something is wrong with that data.

Annatar Apr 23, 2018

Shell and AWK programs have fewest dependencies, are extremely light on resources’ consumption and are extremely fast. When someone’s program emits JSON because the author assumed that programs which ingest that data will also be JavaScript programs, that’s a really bad assumption. It would force the consumer of that data to replicate the environment. This goes against core UNIX principles, as discussed at length in “The art of UNIX programming” book. It’s a rookie mistake to make.

4 More Comments →

arghwhat Apr 23, 2018

Why in the world would I write a parser in bash or awk, regardless of the format? I certainly have better things to do. It doesn't matter how sensible a format is, those tools are simply not appropriate to write a parser in.

I have, however, written a JSON parser before at a previous company in C++ (they didn't want to pull in a library). It wasn't particularly hard.

And yes, like any other parser, it accepted additional non-JSON constructs. This was simply because it was take additional work to error out on those constructs, which would be a waste of time.

Annatar Apr 23, 2018

Why in the world would I have to use JavaScript or any object oriented programming language just because the application is poorly thought out and emits JSON? I certainly have better things to do.

It doesn't matter how sensible a format is, those tools are simply not appropriate to write a parser in.

AWK is a language designed for data parsing and processing. That is what it is designed to do.

How did you solve the parsing of arbitrarily nested structures?

4 More Comments →

IshKebab Apr 22, 2018

Most of the things they test are true of any text-based format, and many of them are true of any serialisation format. E.g. 100000 opening brackets. You could do the same in XML for example and I expect many parsers would fail.

jandrese Apr 23, 2018

Maybe the difference is that nobody ever thought that XML was easy to parse.

SamReidHughes Apr 23, 2018

XML is far worse than JSON, it is known. It's a lot easier to screw up a text based format than a binary format. But it would also be possible to make a better text based format than JSON, mainly by defining down what numerical values are allowed to be represented and exactly how they get parsed, and making it harder to screw up strings. That's where most of the problems are.

jandrese Apr 23, 2018

I think the reason JSON doesn't take well to strong typing is that it is designed for Javascript. When you pass some Javascript object to the JSON encoder how is it supposed to decide which numeric representation to use? Does it always go for the smallest one that works? The Javascript decoder at the other end is just going to throw away that type information anyway, so all it is good for is verifying the data while you are parsing it. Maybe not a totally useless thing to do but it's a lot of work to get that modest benefit.

SamReidHughes Apr 23, 2018

Making an encoding tailored to JavaScript documents that is a binary format is easy. And I don't think strong typing has anything to do with it. Making it handle Python and Ruby objects at the same time is harder, because they have different opinions about what a string is, what a number can be.

_pmf_ Apr 23, 2018

> It's a lot easier to screw up a text based format than a binary format.

I don't see a single instance where this is true.

always_good Apr 23, 2018

I notice that in a topic where it'd be so easy and even necessary to rattle off a few names/examples, you've chosen not to do it.

izacus Apr 22, 2018

How do Protocol Buffers (which I see used quite alot in similar environments as JSON) compare? Anyone has experience in the format?

pjscott Apr 22, 2018

I wrote a protobuf decoder once and found it to be remarkably pleasant. Getting the decoder working only took a few hours. The format was obviously designed to be straightforward -- no escaping, no backtracking, no ambiguity. I believe the grammar is LL(0), which is a nice touch. And because it's not meant to be human-readable, there's no incentive for people to make their parsers deviate from the strict grammar; e.g. there's no protobuf quirk analogous to JSON's parser-dependent handling of trailing commas, because why would anyone bother?

haberman Apr 23, 2018

I'm glad you had a good experience with Protocol Buffers. (I work on the protobuf team at Google). But I would advise a bit of caution. Writing a fully-compliant protobuf implementation is trickier than it looks. I'd recommend running any new parser through the conformance tests; I created these to exercise some of the edge cases.

https://github.com/google/protobuf/tree/master/conformance

Here are some of the quirks of protobuf:

    - non-repeated fields can occur multiple times on
      the wire -- the last value "wins".
    - you have to be able to handle unknown fields, including
      unknown groups that can be nested arbitrarily.
    - repeated numbers have two different wire formats (packed
      and non-packed), you have be able to handle both.
    - when serializing, all signed integers need to be sign-
      extended to 64 bits, to support interop between different
      integer types.
    - you have to bounds-check delimited fields to make sure
      they don't violate the bounds of submessages you are
      already in.

I do think protobuf is a great technology overall. But it has some complexities too; I wouldn't want to oversell its simplicity and have people be unpleasantly surprised when they come across them later. :)

zeveb Apr 23, 2018

> when serializing, all signed integers need to be sign-extended to 64 bits, to support interop between different integer types

FWIW, I think it'd be awesome if the only wire integer format were a bignum, in order to support full interoperability between integer types. Maybe even do that same for floats, too …

iainmerrick Apr 22, 2018

There is no such quirk. Trailing commas are simply not standard JSON (sadly).

Even if they were, it would still be easy to parse!

pjscott Apr 22, 2018

You're right in the same technically-correct sense that parsing HTML 4 is easy: it's just SGML! And as long as you don't have to support any of the crazy deviations from the spec that some people have come to rely on, that's fine.

There's a whole spectrum of unofficial parser "helpfulness" here, with HTML 4 being an extreme case of parsers filled with hacks to deal with existing broken data, protobufs being an extreme case of parsers doing the One and Only True Thing, and JSON mostly toward the same end of the spectrum as protobufs, but a bit less so.

iainmerrick Apr 23, 2018

Yeah, that spectrum is a good way to think about it.

JSON hits a sweet spot of being very easy for computers to deal with almost all the time, while also being reasonably easy for humans to read and write.

I was going to add “if you started with that as the spec, it wouldn’t be hard to design something better than JSON” but real examples like YAML are pretty awkward, so probably it’s a harder problem than it seems.

2 More Comments →

Mikhail_Edoshin Apr 23, 2018

It would be much easier to emit if they were standard. For example, in XML each element is self-contained; I can pour it into the data stream without knowing if it is preceded or followed by a sibling. With JSON I have to manage the context.

toast0 Apr 22, 2018

My experience with json and similar formats is that most of the complexity arrises from using delimited strings instead of length prefixed strings, and the exciting escaping that results. If the strings are character strings instead of byte strings, you get to add an extra layer of character encoding excitement.

PHP serialization is better here, everything is type:value or type:length:value, although strings do have quotes around them, because their byte length is known, internal quotes need not be escaped. You can still have issues with genrating and parsing the human readible numbers properly (floating point is always fun, and integers may have some bit size limit I don't recall), but you don't need to worry about quoting Unicode values properly.

Protocol buffers have clear length indications, so that's easier, but it's not a 'self documenting' format, you need to have the description file to parse an encoded value. The end result is usually many fewer bits though.

ChrisSD Apr 22, 2018

The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.

Protobuf and similar are binary formats so don't have this limitation.

zeveb Apr 24, 2018

> The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.

Canonical S-expression are both human-readable & length-prefixed. They do this by have an advanced representation which is human-friendly:

    (data (looks "like this" |YWluJ3QgaXQgY29vbD8=|))

And a canonical representation which is length-prefixed:

    (4:data(5:looks9:like this14:ain't it cool?))

johannes1234321 Apr 22, 2018

The PHP serialisation format has many issues, especially since it allows all sorts of PHP data structures to be encoded. This allows defining references and serializing objects using custom routines into arbitrary binary blobs. Also PHP's unserialization can be used to trigger the autoloader as it tries to resolve unloaded classes, which can trigger unsafe routines in those.

Certainly no data format for data exchange between systems, especially untrusted sources.

toast0 Apr 23, 2018

You shouldn't use PHP's unserialize implementation with untrusted sources; but my point was that its format makes it relatively simple to parse vs json or xml where you have to do a lot of work to parse strings. If you're writing your own parser (including a parser for another language), you could decide to only parse basic types (bool, int, float, array); if you're designing your own format, you could take the lesson of length prefixed strings are much easier to use for computers than delimited strings.

ori_b Apr 22, 2018

Protocol buffers sidestep the issue of independent implementations behaving differently by simply not having widely used independent implementations.

That said, I've still been bitten by the Python implementation on the Mac acting differently from the C++ implementation on Linux, although I can't remember exactly what the issue was right now.

agrafix Apr 23, 2018

One of the things that has bitten me before was the maximum allowed nesting depth - this was different in two implementations, so one rejected the payload while the other one parsed it fine :(

ccvannorman Apr 22, 2018

Indeed, this is the principal point behind Godel Escher Bach, Hofstadter's work which explores (by various proofs, theories and parables) how no absolutes exist, particularly with respect to structuring/interpreting data.

"It doesn't make sense to formalize a system absolutely"

This item has no comments currently.