Parsing JSON is a Minefield

339 points Apr 22, 2018

skywhopper Apr 22, 2018

While this is true of JSON, it's also true of any other non-trivial serialization and/or encoding format. The main lessons to learn here are that:

1) implementation matters

2) "simple" specs never really are

It's definitely important to have documents like this one that explore the edge cases and the differences between implementations, but you can replace "JSON" in the introductory paragraph with any other serialization format, encoding standard, or IPC protocol and it would remain true:

"<format> is not the easy, idealised format as many do believe. [There are not] two libraries that exhibit the very same behaviour. Moreover, [...] edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because <format> libraries rely on specifications that have evolved over time and that left many details loosely specified or not specified at all."

arghwhat Apr 22, 2018

It really isn't true for JSON either. If you read the rant, most of it is simply about JSON parsers accepting additional, non-JSON syntaxes.

Looking at the matrix, all green, yellow, light blue and dark blue are OK outcomes. Red are crashes (stack overflow with 10000 nested arrays, for example), and dark brown are valid JSON that didn't parse (things like UTF-8 mishandling). The issues aren't really JSON-specific.

dagenix Apr 23, 2018

So if JSON parsers can't agree on what JSON actually is, that isn't a problem?

arghwhat Apr 23, 2018

JSON parsers all agree what JSON actually is (except for the brown colored fields in the compat matrix, which are actual JSON compat bugs).

JSON parsers may also permit various additional variations which are very explicitly not JSON. This means that they may accept a handcrafted, technically invalid JSON document. However, a JSON encoder may never generate a document containing such "extensions", as this would not be JSON.

This concept of having parsers accept more than necessary follows best practices of robustness. As the robustness principle goes: "Be conservative in what you do, be liberal in what you accept from others".

jacobparker Apr 23, 2018

> As the robustness principle goes [...]

"The Harmful Consequences of the Robustness Principle" https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0...

2 More Comments →

benzoate Apr 23, 2018

I consider it more like behaviour outside of the spec is undefined, and some parsers have bugs.

dagenix Apr 23, 2018

A good spec should define all those corner cases. With Json, it's possible for two bug free libraries to take the same input and produce a different output.

arghwhat Apr 24, 2018

Only if the input is invalid JSON.

ospider Apr 23, 2018

Paring "csv" and "obj" format is also difficult. They both have "simple" specs, but neither of them has a standard spec.

jstimpfle Apr 23, 2018

They both have after-the fact specifications that (try to) codify various flavours that had already been released into the wild. (I assume you mean Wavefront Obj). It's a pity that they aren't better specified. The same holds for ini. Maybe that's just the curse of good engineering: If it's simple, it's easy to make a mess out of it.

SamReidHughes Apr 22, 2018

No, this is not true of many reasonable formats. You don't have to make an obtusely nontrivial format to encode the data JSON does.

arghwhat Apr 22, 2018

JSON is fairly trivial. The post is a nonsensical rant about parsers accepting non-JSON compliant documents (as the JSON spec specifically states that parsers may), such as trailing commas.

In the large colored matrix, the following colors mean everything is fine: Green, yellow, light blue and deep blue.

Red are crashes (things like 10000 nested arrays causing a stack overflow—this is a non-JSON-specific parser bug), and dark brown are constructs that should have been supported but weren't (things like UTF-8 handling, which is again non-JSON specific parser bugs).

Writing parsers can be tricky, but JSON is certainly not a hard format to parse.

SamReidHughes Apr 23, 2018

As a question of fact, programs put out JSON that gets misparsed by other programs. Some simply parse floating point values differently, or they treat Unicode strings incorrectly, or output them incorrectly. Different parsers have different opinions about what a document represents. This has a real world impact.

Accepting invalid or ambiguous or undefined JSON is not an acceptable behavior. It means bugs get swallowed up and you can't reliably round trip data.

repsilat Apr 23, 2018

> Accepting invalid or ambiguous or undefined JSON is not an acceptable behavior

Just to make it explicit (and without inserting any personal judgement into the conversation myself): JSON parsers should reject things like trailing commas after final array elements because it will encourage people to emit trailing commas?

Having asked the question (and now explicitly freeing myself to talk values) it's new to me -- a solid and rare objection to the Robustness Principle. Maybe common enough in these sorts of discussions, though? Anyway, partial as I might be to trailing commas, I do quite like the "JSON's universality shall not be compromised" argument.

SamReidHughes Apr 23, 2018

Postel's Law or the "Robustness Principle" is an anti-pattern in general.

Accepting trailing commas in JSON isn't as big a deal as having two different opinions about what a valid document is. But you might think a trailing comma could indicate a hand-edited document that's missing an important field or array element.

arghwhat Apr 23, 2018

Unicode and floating point misparsing are not even remotely JSON related, but are simply bugs that can occur in any parser that handles unicode or floating point. Thus, complaining about it in a "JSON is a minefield" thread is a bit silly.

If you put out JSON that gets misparsed, you either generated invalid JSON, or the parser is faulty. Nothing around that.

This has nothing to do with whether parsers have flexibility to accept additional constructs, which is extremely common for a parser to do.

Annatar Apr 23, 2018

Actually unless one is doing JavaScript, JSON is extremely difficult to parse correctly. I challenge you to write a simple, understandable JSON parser in Bourne shell or in AWK.

repsilat Apr 23, 2018

> JSON is extremely difficult to parse correctly ... in Bourne shell or in AWK.

Sorry for the misquote, but does it get to the heart of your objection?

I'm torn here. On the one hand I want to say "Those are not languages one typically writes parsers in," but that's a really muddled argument:

1. People "parse" things often in bash/awk because they have to -- because bash etc deal in unstructured data.

2. Maybe "reasonable" languages should be trivially parseable so we can do it in Bash (etc).

I'm kinda torn. On the one hand bash is unreasonably effective, on the other I want data types to be opaque so people don't even try to parse them... would love to hear arguments though.

7 More Comments →

arghwhat Apr 23, 2018

Why in the world would I write a parser in bash or awk, regardless of the format? I certainly have better things to do. It doesn't matter how sensible a format is, those tools are simply not appropriate to write a parser in.

I have, however, written a JSON parser before at a previous company in C++ (they didn't want to pull in a library). It wasn't particularly hard.

And yes, like any other parser, it accepted additional non-JSON constructs. This was simply because it was take additional work to error out on those constructs, which would be a waste of time.

5 More Comments →

IshKebab Apr 22, 2018

Most of the things they test are true of any text-based format, and many of them are true of any serialisation format. E.g. 100000 opening brackets. You could do the same in XML for example and I expect many parsers would fail.

jandrese Apr 23, 2018

Maybe the difference is that nobody ever thought that XML was easy to parse.

SamReidHughes Apr 23, 2018

XML is far worse than JSON, it is known. It's a lot easier to screw up a text based format than a binary format. But it would also be possible to make a better text based format than JSON, mainly by defining down what numerical values are allowed to be represented and exactly how they get parsed, and making it harder to screw up strings. That's where most of the problems are.

jandrese Apr 23, 2018

I think the reason JSON doesn't take well to strong typing is that it is designed for Javascript. When you pass some Javascript object to the JSON encoder how is it supposed to decide which numeric representation to use? Does it always go for the smallest one that works? The Javascript decoder at the other end is just going to throw away that type information anyway, so all it is good for is verifying the data while you are parsing it. Maybe not a totally useless thing to do but it's a lot of work to get that modest benefit.

SamReidHughes Apr 23, 2018

Making an encoding tailored to JavaScript documents that is a binary format is easy. And I don't think strong typing has anything to do with it. Making it handle Python and Ruby objects at the same time is harder, because they have different opinions about what a string is, what a number can be.

_pmf_ Apr 23, 2018

> It's a lot easier to screw up a text based format than a binary format.

I don't see a single instance where this is true.

always_good Apr 23, 2018

I notice that in a topic where it'd be so easy and even necessary to rattle off a few names/examples, you've chosen not to do it.

izacus Apr 22, 2018

How do Protocol Buffers (which I see used quite alot in similar environments as JSON) compare? Anyone has experience in the format?

pjscott Apr 22, 2018

I wrote a protobuf decoder once and found it to be remarkably pleasant. Getting the decoder working only took a few hours. The format was obviously designed to be straightforward -- no escaping, no backtracking, no ambiguity. I believe the grammar is LL(0), which is a nice touch. And because it's not meant to be human-readable, there's no incentive for people to make their parsers deviate from the strict grammar; e.g. there's no protobuf quirk analogous to JSON's parser-dependent handling of trailing commas, because why would anyone bother?

haberman Apr 23, 2018

I'm glad you had a good experience with Protocol Buffers. (I work on the protobuf team at Google). But I would advise a bit of caution. Writing a fully-compliant protobuf implementation is trickier than it looks. I'd recommend running any new parser through the conformance tests; I created these to exercise some of the edge cases.

https://github.com/google/protobuf/tree/master/conformance

Here are some of the quirks of protobuf:

    - non-repeated fields can occur multiple times on
      the wire -- the last value "wins".
    - you have to be able to handle unknown fields, including
      unknown groups that can be nested arbitrarily.
    - repeated numbers have two different wire formats (packed
      and non-packed), you have be able to handle both.
    - when serializing, all signed integers need to be sign-
      extended to 64 bits, to support interop between different
      integer types.
    - you have to bounds-check delimited fields to make sure
      they don't violate the bounds of submessages you are
      already in.

I do think protobuf is a great technology overall. But it has some complexities too; I wouldn't want to oversell its simplicity and have people be unpleasantly surprised when they come across them later. :)

zeveb Apr 23, 2018

> when serializing, all signed integers need to be sign-extended to 64 bits, to support interop between different integer types

FWIW, I think it'd be awesome if the only wire integer format were a bignum, in order to support full interoperability between integer types. Maybe even do that same for floats, too …

iainmerrick Apr 22, 2018

There is no such quirk. Trailing commas are simply not standard JSON (sadly).

Even if they were, it would still be easy to parse!

pjscott Apr 22, 2018

You're right in the same technically-correct sense that parsing HTML 4 is easy: it's just SGML! And as long as you don't have to support any of the crazy deviations from the spec that some people have come to rely on, that's fine.

There's a whole spectrum of unofficial parser "helpfulness" here, with HTML 4 being an extreme case of parsers filled with hacks to deal with existing broken data, protobufs being an extreme case of parsers doing the One and Only True Thing, and JSON mostly toward the same end of the spectrum as protobufs, but a bit less so.

3 More Comments →

Mikhail_Edoshin Apr 23, 2018

It would be much easier to emit if they were standard. For example, in XML each element is self-contained; I can pour it into the data stream without knowing if it is preceded or followed by a sibling. With JSON I have to manage the context.

toast0 Apr 22, 2018

My experience with json and similar formats is that most of the complexity arrises from using delimited strings instead of length prefixed strings, and the exciting escaping that results. If the strings are character strings instead of byte strings, you get to add an extra layer of character encoding excitement.

PHP serialization is better here, everything is type:value or type:length:value, although strings do have quotes around them, because their byte length is known, internal quotes need not be escaped. You can still have issues with genrating and parsing the human readible numbers properly (floating point is always fun, and integers may have some bit size limit I don't recall), but you don't need to worry about quoting Unicode values properly.

Protocol buffers have clear length indications, so that's easier, but it's not a 'self documenting' format, you need to have the description file to parse an encoded value. The end result is usually many fewer bits though.

ChrisSD Apr 22, 2018

The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.

Protobuf and similar are binary formats so don't have this limitation.

zeveb Apr 24, 2018

> The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.

Canonical S-expression are both human-readable & length-prefixed. They do this by have an advanced representation which is human-friendly:

    (data (looks "like this" |YWluJ3QgaXQgY29vbD8=|))

And a canonical representation which is length-prefixed:

    (4:data(5:looks9:like this14:ain't it cool?))

johannes1234321 Apr 22, 2018

The PHP serialisation format has many issues, especially since it allows all sorts of PHP data structures to be encoded. This allows defining references and serializing objects using custom routines into arbitrary binary blobs. Also PHP's unserialization can be used to trigger the autoloader as it tries to resolve unloaded classes, which can trigger unsafe routines in those.

Certainly no data format for data exchange between systems, especially untrusted sources.

toast0 Apr 23, 2018

You shouldn't use PHP's unserialize implementation with untrusted sources; but my point was that its format makes it relatively simple to parse vs json or xml where you have to do a lot of work to parse strings. If you're writing your own parser (including a parser for another language), you could decide to only parse basic types (bool, int, float, array); if you're designing your own format, you could take the lesson of length prefixed strings are much easier to use for computers than delimited strings.

ori_b Apr 22, 2018

Protocol buffers sidestep the issue of independent implementations behaving differently by simply not having widely used independent implementations.

That said, I've still been bitten by the Python implementation on the Mac acting differently from the C++ implementation on Linux, although I can't remember exactly what the issue was right now.

agrafix Apr 23, 2018

One of the things that has bitten me before was the maximum allowed nesting depth - this was different in two implementations, so one rejected the payload while the other one parsed it fine :(

ccvannorman Apr 22, 2018

Indeed, this is the principal point behind Godel Escher Bach, Hofstadter's work which explores (by various proofs, theories and parables) how no absolutes exist, particularly with respect to structuring/interpreting data.

"It doesn't make sense to formalize a system absolutely"

JasonFruit Apr 22, 2018

This is interesting and important in one way: anything poorly specified will eventually cause a problem for someone, somewhere. That being said, my first response was to complete the title, ". . . yet it remains useful and nearly trouble-free in practice." There's a lot of, "You know what I mean!" in the JSON definition, but in most cases, we really do know what Crockford means.

Someone Apr 22, 2018

If your API takes json input, some of those issues are potential security or DoS issues.

For example, if you validate your json in your web front-end (EDIT: I used the wrong term. What I meant here is the server-side process that’s in front of your database) and then pass the string received to your json-aware database, you’re likely using two json implementations that may have different ideas about what constitutes valid json.

For example, a caller might pass in a dictionary with duplicate key names, and the two parsers might each drop a different one, or one might see json where the other sees a comment.

helaan Apr 22, 2018

Reminds me of last years CouchDB bug (CVE-2017-12635) which was caused by two JSON parsers disagreeing on duplicate keys: here it was possible to add a second key with user roles, allowing a user to give admin rights to itself. JSON parser issues are real.

xenadu02 Apr 23, 2018

One of the benefits of serialization technology (like Codable+JSONEncoder in Swift or DataContract in C#) is that you get a canonical representation of the bits in memory before you pass the document on to anyone else.

By representing fields with enums or proper types you get some constraints on values as well, eg: If a value is really an integer field then your type can declare it as Int and deserialization will smash it into that shape or throw an error, but you don't end up with indeterminate or nonsense values.

This can be even more important for UUIDs, Dates, and other extremely common types that have no native JSON representation, nor even any agreed-upon consensus around them.

You get less help from the language with dynamic languages like Python but you can certainly accomplish the same thing with some minimal extra work. Or perhaps it would be more accurate to say languages like Python offer easy shortcuts that you shouldn't take.

In any case I highly recommend this technique for enforcing basic sanitization of data. The other is to use fuzzing (AFL or libFuzzer).

SOLAR_FIELDS Apr 22, 2018

This specific RCE vulnerability was actually given as an explicit example of the consequences of the current state of the specifications.

mjevans Apr 23, 2018

Normalize before approval and add filters that only allow in /expressly approved/ items from insecure environments.

paradite Apr 22, 2018

I think it is rather common sense to do data validation on backend instead of frontend. What matters is that backend always acts as the source of truth, it doesn't really matter if frontend and backend are inconsistent as long as we know that backend data is correct.

guntars Apr 22, 2018

I sure hope you don’t just put random user provided blobs in your database, even if they’re validated. Also, how do you validate without parsing? If it’s parsed, might as well serialize again when saving to the DB.

Someone Apr 22, 2018

”If it’s parsed, might as well serialize again when saving to the DB”

You didn’t grow up in the 1980’s, I guess :-)

Why spend cycles serializing again if you already have that string?

toast0 Apr 22, 2018

Because experience has shown us that today's parsers don't detect tomorrow's 0-day parsing bugs; but serializing a clean version of what was parsed is more likely to be safe (see lots of jpeg, mpeg, etc exploits)

5 More Comments →

krapp Apr 22, 2018

If it's parsed, why even store in in a database as JSON at all?

If you don't do that... then multiple possible JSON parsers aren't a problem.

Mikhail_Edoshin Apr 23, 2018

Use case: I sync local data with web API. I do not use all the data I receive, only a few bits, but if I modify them, I have to send a complete object back to the server with all the other data. The simplest way to do this is to store the original JSON.

The CardDAV and CalDAV are not JSON, but their specification also requires you to preserve the whole vCard if you ever want to send your changes back to the server. CardDAV data may be accessed by multiple apps and they are allowed to add their private properties; any app that deals with vCards must preserve all properties, including those it doesn't understand or use.

couchand Apr 23, 2018

One common reason is to provide a flexible table for things that may not have an identical schema. For instance, an event log might have details about each event that differ based on the event type.

leothelocust Apr 22, 2018

Fully agree. You can cause parsing issues, but you can also... not.

If you are creating a JSON response from your own API, you control the JSON output.

Unless you are crafting JSON from scratch I doubt anyone runs into the issues mentioned in the OP.

bufferoverflow Apr 22, 2018

> anything poorly specified

I thought JSON was specified quite clearly.

http://json.org/

There are no limits of the loopy things (the number of consecutive digits in numbers), but I don't consider that a weakness of the standard.

Most of the tests that I see do pass completely invalid JSON.

http://seriot.ch/json/pruned_results.png

spc476 Apr 23, 2018

So, 9223372036854775807 is a valid number per the json.org spec, but good luck getting a typical JSON decoder to process that number. A couple I tried returned it as 9.2233720368548e+18, which is not the same number.

bloak Apr 23, 2018

Isn't that a limitation of the language and API rather than the parser/decoder? I would guess that most users don't want a JSON decoder that depends on some library for arbitrary-precision numbers and returns such numbers for the user's inconvenience.

The summary table suggests that a real bug was found in about half of the parsers tested, and even a few of those bugs belong to a category that one might choose to ignore: almost any non-trivial function can be made to run out of memory, and lots of functions will crash ungracefully when that happens, rather than correctly free all the memory that was allocated and return an error code, which the caller probably doesn't handle correctly in any case because this case was never tested.

freshhawk Apr 22, 2018

It would be great if programmers learned from markdown and json.

Here is the lesson:

1. We need something simpler, so I will make a simple solution to this problem 2. Simple should also mean no strict spec, support for versioning or any of those engineer things. All that engineer shit is boring and I can tell myself this laziness is "staying simple" 3. OH SHIT, I was totally right about #1 so this got popular and having been designed as just a toy is causing a lot of problems for a massive number of people ... now my incompetence regarding #2 is on display and there is nothing I can do about it

I'm not saying "Thou shalt always add bureaucracy to your toy projects", but look at what happened and think about how Gruber and Crawford will be remembered, partly if not mostly, for being "the asshole who screwed up X". If you go the other way programmers will think "damn I hate these RFCs, these suits are messing up the beautiful vision of Saint [your name here]".

mchanson Apr 22, 2018

Oh yes RFC process always keeps things from having compatibility issues.

I definitely never saw any issues with all those XML based standards like SOAP or XSLT.

eadmund Apr 22, 2018

A lot of that is because XML is objectively insane: it's a monumentally over-specified version of something that a sane community would have sketched out on the back of a cocktail napkin. XML is S-expressions done wrong. It's a massive amount of ceremony & boilerplate, IMHO due to the pain of dealing with dynamic data in static languages. It's basically the Java of data-transfer languages.

And it shouldn't even be used for data transfer: it's a markup language, for Pete's sake!

Mikhail_Edoshin Apr 23, 2018

Please. XML specification is much shorter than that of YAML, for example, even though XML 1.0 includes a simple grammar-based validation spec (DTD). "A markup language"? What does it mean? Are there any special "data-transfer languages" we neglect? :) Data gets serialized; we need to mark different parts of it; XML can totally do it. For some cases it's not the best fit, but nothing is.

zmix Apr 25, 2018

> And it shouldn't even be used for data transfer: it's a markup language

XML could never be used for data transfer. That is being done by the protocol. That would be, in most of the XML cases: HTTP(S).

GET http://en.wikipedia.org/wiki/Rolling_Stones/wp:discography/w... GET http://store.steamcommunity.com/profile/myprofile/games.xml/...

Wow! That's a beauty! And that's only because XML is BOTH a document and a data structure. It has two personalities, but only one identity. And it is not schizophrenic about it. It's always clear.

> A lot of that is because XML is objectively insane

I don't find "everything is a node" to be insane. It's like "Everything is a file" followed through up to the atomic value.

/net/host/volume/directory/file.xml/document-node/some/other/node/attribute

/net/host/volume/directory/file.xml//all-nodes[@where-this-attributes-value="foobar"]

Looks like a perfect match for both command line as well as RESTful access.

> monumentally over-specified

The XML spec, while having a healthy size, is not overly big: https://www.w3.org/TR/xml/

Do not confuse the additional specs like XSL, XPath and XQuery as the "XML" spec. These are your toolbox. And their volume is in no way bigger than any of the frameworks, programmers use. Also XSD is not really part of the XML spec. You don't need it in many cases.

It's a meta language, that consists of a simple convention: Elements and Attributes. You name them what you want and get a document, that, at the same time is a queriable datastructure. But I've said that already...

dagenix Apr 23, 2018

"objectively insane", "over-specified". I'm not sure what "objectively sane" is, but, I guess something specified that way would be a good thing - provided its not "over-specified" since that would be a bad thing, I guess.

o_____________o Apr 23, 2018

Wouldn't it be great if we had some ML to transform unnecessary sarcasm?

> RFC processes don't always keep things from having compatibility issues. XML based standards like SOAP or XSLT had many issues.

It would make the internet seem much less passive aggressive. Sarcasm is often just a thin, bitter encoding over simple statements.

unmole Apr 23, 2018

IETF's RFC process does a pretty good job at keeping things compatible. But I agree W3C specs, especially the SOAP ones are absolutely horrible.

status_quo69 Apr 23, 2018

I'm not going to disagree that data formats should undergo a process.... but sometimes they do undergo too much process. Most of the EDI standards, for example, are a product of a huge amount of committee that sells a lot of standards and are used extensively yet are almost impossible to write a parser for.

Edit: obviously this isn't the case, since people have actually parsed it, yet it's a non-trivial thing to do and I have yet to see a solution that isn't horribly specific or a solution that ignores something that might be problematic in non-standard solutions (of which there are a million because of the difficulty involved). A huge amount of domain specificity doesn't help either.

lmm Apr 23, 2018

> look at what happened and think about how Gruber and Crawford will be remembered, partly if not mostly, for being "the asshole who screwed up X".

Completely disagree. Gruber and Crockford are remembered as the people who came up with these really good formats, much better than the overdesigned alternatives.

freshhawk Apr 23, 2018

Ugh, it is too late to do anything about that formatting. Sorry.

borplk Apr 22, 2018

I think the idea of humans sharing a language with computers is problematic at a fundamental level.

(the whole $dataformat "easy to read for humans")

It becomes a source of never ending lose-lose compromises where the more points you give to human convenience the more points you take away from machine convenience and vice versa.

Then you end up having to "settle" for something in between that is just ambiguous and problematic for machines to be able to deal with it and just noisy enough for humans to be able to cope looking at it and edit it. That's basically what JSON is.

If we accept to use a transformation step and better tooling we can free the representation from this tension of "friendly for computers vs friendly for humans".

It's also a bit odd that we apply this readability obsession only to these data formats.

I don't hear people wanting a human readable text representation of their audio, video or images.

rainbowmverse Apr 22, 2018

>> I don't hear people wanting a human readable text representation of their audio, video or images.

This is, in fact, a huge concern for people who think about accessibility.

borplk Apr 22, 2018

Ok but what I'm talking about is a little more specific.

Talking about "human readability" of JSON and XML is a little bit like talking about human readability of JPEG or MP3.

Chasing after it creates a lot of problems.

Formats like JSON and XML often carry lots of textual information so it's tempting to want them to be like "just like text but with some extra stuff" but that creates its own problems.

So it would be interesting to have something like JSON but philosophically treat it like MP3. Meaning, don't assume that humans must fiddle with the bytes in a text editor with great ease so that the representation can be designed without the influence of "would the raw bytes look pretty to people?".

zenhack Apr 23, 2018

What you're proposing sounds like cbor and/or messagepack (which are virtually identical in their design), or argdata[1].

I agree it's a pretty solid spot in the design space.

[1]: https://github.com/NuxiNL/argdata

bastawhiz Apr 22, 2018

Indeed! Facebook spends a ridiculous amount of resources creating text summaries of what's in user-provided images. Check out the alt="" tags next time you're scrolling through your feed. Every single image has one.

saagarjha Apr 23, 2018

Of course, their methodology for doing so is nonstandard, since there are obvious incentives for Facebook to use it and keep it to themselves. Unfortunately this keeps these transformations out of reach for most applications.

delecti Apr 23, 2018

I just checked what you're talking about. Creepiness aside, that's kinda awesome.

kalleboo Apr 23, 2018

iOS also does the same thing - if you turn on VoiceOver and go into your photo library it will read out what's in the photos

azernik Apr 22, 2018

Generally the way we address it is not to make the formats themselves human-readable; that's a fool's errand. Instead we create tools to transform them into human-readable formats on demand. See also the protobuf toolchain.

nuclx Apr 23, 2018

I would argue, that YAML (which actually is a superset of JSON) comes closer to JSON in being friendly for humans.

ajross Apr 22, 2018

Nothing in your argument seems relevant to whether your mythical format is "readable" or not. The industry has had extensive experience dealing with interchange of commonly-used binary formats (.xls comes to mind as one that still seems some use) and if anything that experience was even worse.

You're just saying that if we stopped trying to adhere to "our" aesthetic and replaced it with "your" aesthetic when we created an all-new format, that it would be better. I think there's an XKCD to cite about that...

borplk Apr 22, 2018

You have misunderstood the point I was trying to make.

I'm saying the idea that we as humans should view/consume the exact same thing raw and byte-by-byte without any transformations in between as computers do creates a problematic trade-off because that representation now has two very different masters, humans and machines.

If you want to make that representation pretty and friendly for humans it becomes hard to parse and consume for machines.

If you try to make it friendly for machines it becomes inaccessible, noisy and annoying for humans.

So you have to settle somewhere in between that is tolerable for both masters but not perfect for either one.

With some exaggeration, it's as if we required the binary representation of an image to be "raw viewable/consumable" by humans in a text editor.

ajross Apr 22, 2018

No, I'm pretty sure I took that point. It's just that... that isn't the problem. All formats get messy for the same reason that all software designs get messy. It's just that messy software is amenable to replacement, while messy formats leave their garbage in public to scream about on HN.

And like I said, we've been where you want to be: .xls, in particular, is actually a very simple format at its core and easily inspectable with a hex dump or whatnot. Likewise Wordperfect's binary format back in the day was straightforward for humans to use, yet still binary. It didn't help. They still sucked.

To wit: you can't solve this problem your way. All you'll do is create another messy format, c.f. XKCD 927.

peoplewindow Apr 23, 2018

You're using a file format designed three decades ago to make arguments about what we should be doing today.

XLS as a format sucks because:

a) It was undocumented for most of its life

b) Undocumented even inside Microsoft because it consisted partly of memory dumps from the app

c) Was heavily optimised for fast loading and saving on very slow machines

100% of uses of JSON, translated to binary, would not suffer those issues. They'd have documented schemas, at least internally, they wouldn't be created by memcpy to disk, and they wouldn't be stuffed with app-specific loading optimisations.

__david__ Apr 23, 2018

You mean like XBM, XPM, SVG, and EPS?

amyjess Apr 23, 2018

I remember being horrified to discover that XPM is actually a form of C. That just screams injection risk.

eponeponepon Apr 22, 2018

This is the one thing that the JSON-against-XML holy warriors need to understand properly. Yes, JSON's less verbose; yes, it's just "plain text" (in as much as there is such a thing); yes, XML makes you put closing tags in - but if you need reliable parsing and rock-solid specifications (and it's reasonably likely that you do, even if you think you don't...), then XML, for all its faults, is very likely the better way.

jstimpfle Apr 22, 2018

Xml sort-of died (in many domains) because of its insane complexity and its redundant ways of specifying relations (child relationships vs explicit relationships, tag name data, attributes data, body data), lack of legibility, parser performance (probably an inherent problem due to hierarchical representation?) and other issues like even meaning of whitespace. >50% of the JSON or Xml I've seen would actually be much easier to read and have much clearer semantics, if just written as a relational database. Some time ago, I tried to improve on CSV by specifying it better and making it more powerful. The result was not too bad: http://jstimpfle.de/projects/wsl/main.html . But I think it should be trimmed down even more, and have only built-in datatypes like JSON, to be able to replace it. (More ambitious standardization efforts would lead to similar problems as with Xml, I think). That's why so far I use the approach only in ad-hoc implementations, in different flavours needed for various tasks.

ProblemFactory Apr 22, 2018

XML parsers are necessarily over-complicated for structured data, because it is a text markup language, not a nested data structure language.

<address>123 Hello World Road, <postcode>12345</postcode>, CA</address>

is perfectly sensible XML. The address is not a tree structure or a key-value dictionary - it is free text with optional markup for some words.

You can use XML to represent nested data structures with lists and dictionaries, but the parsers and their public APIs must still handle the freeform text case as well.

jstimpfle Apr 22, 2018

Yep, the application to text documents is valid in my eyes, as well. Although there are lighter weight and/or more extensible approaches, like TeX. (update, clarificaton: I mean just the markup syntax, not the compuational model)

anonymouz Apr 22, 2018

> TeX

Dear god no. I use and love (La)TeX daily to write documents. But as a markup format for data that's supposed to be processed in any way, other than being fed to a TeX engine, it's absolutely terrible. You can't even really parse a TeX document; with all the macros it really is more a program than a document. XML is far from perfect, but it works well as a markup and data exchange format that is well-specified.

_delirium Apr 22, 2018

I like TeX for producing documents. But I'd take XML over TeX if I had to parse the markup myself, outside of the TeX toolchain. Any nontrivial TeX document is built out of a pile of macros, so you need to implement a TeX-compatible macro expander to parse it. And at least with XML there are solid libraries, while the state of TeX-parsing libraries outside of TeX itself is pretty poor. I think Haskell is the only language with a reasonably good implementation, thanks to the efforts of the pandoc folks.

2 More Comments →

Someone Apr 22, 2018

Lighter-weight? Tex is Turing-complete. You can’t even know whether interpreting it will ever finish, and writing a parser that produces good error messages on invalid input is difficult.

edejong Apr 22, 2018

From someone who has written TeX macro’s before: you probably mistake the ‘clean’ environment of LaTeX with the core TeX language. The former is reasonable, if very limited, the latter is die-hard “you thought you knew how to program, but this proves you wrong”-material.

XML over TeX any time and LISP-like over XML (with structural macros)

4 More Comments →

dagenix Apr 23, 2018

"Insane complexity"? You mean that it actually has a spec instead of the back of a business card with no implementations that agree on what is actually valid?

jstimpfle Apr 23, 2018

Yes. No. What it does is too complicated. And I hear that consistent implementations are not a reality for Xml, either (at least regarding implemented features).

codahale Apr 22, 2018

XML parsing is notably an even larger minefield: https://www.owasp.org/index.php/XML_Security_Cheat_Sheet

MaulingMonkey Apr 22, 2018

I've written JSON parsers to replace platform specific JSON parsers with bug-for-bug (or at the very least misfeature-for-misfeature) parity to port code without breaking it, without too much going terribly wrong. I wouldn't even try to attempt the same for XML.

Generating a useful conservative subset of JSON that most/all JSON serializers will accept hasn't been that hard in practice IME (no trailing commas, escape all unicode, don't assume >double precision/range scalars, etc.), but I still haven't figured out how to do the same for some XML serializers (failing to serialize because it lacks 'extra' annotation tags in some cases, failing to serialize because it doesn't ignore 'extra' annotation tags in other cases...)

drawkbox Apr 22, 2018

Vogons destroyed XML and they would love to destroy JSON. Back away from the JSON vogons, go make another 'simple' format that you put in all your edge cases for complexity. Just try to make a format more simple than JSON, it is based off the basic object, list and basic types string, number, date, bool etc. Where data doesn't fit in those you make it fit or move to another format like YAML, BSON, XML, binary standard like Protobuf or custom binary where needed for say real-time messaging when you control both endpoints always otherwise you have to constantly update a client consumer as well.

JSON is a data and messaging format meant to simplify. If you can't serialize/deserialize to/from JSON then your format might be too complex, and if it doesn't exactly fit in JSON just put the value in a key and add a 'type' or 'meta' key that allows you to translate to and from it. If binary store it in base64, if it is a massive number put it in a string and a type next to it to convert to and from. JSON is merely the messenger, don't shoot it. JSON is so simple it can roll from front-end to back-end where parsing XML/binary in some areas is more of a pain especially for third party consumers.

JSON being simple actually simplifies systems built with it which is a good thing for engineers that like to take complexity and make it simple rather than simplicity to complexity like a vogon.

eropple Apr 22, 2018

If other people suggesting that, hey, maybe we should actually be able to express a number correctly makes you splutter about "vogons" or whatever, perhaps it is not they who should take a step back. (For this isn't just "massive" numbers, but anything that isn't a float--themselves ranking just after `null` as the worst disaster in current use in general-purpose programming.)

Telling people to "just" take actions that decrease the reliability and the rigor of their data because of...vogons?...is one of those weird middlebrow things that HN tends to try to steer clear of, last I checked.

(edit: To be clear, I get the reference, I think it's a silly one both for the childish regard the poster to whom I am replying has for other people and textually because it doesn't even hang.)

drawkbox Apr 22, 2018

> If other people suggesting that, hey, maybe we should actually be able to express a number correctly makes you splutter about "vogons" or whatever, perhaps it is not they who should take a step back.

I guess what I am saying is JSON was created for simplicity and needs no updates.

XML has already been created and other formats like BSON, YAML etc or create a new one that suits more detailed needs.

The sole reason that JSON is so successful is it has fought against 'vogon' complication and bureaucracy that riddled XML and many binary formats of the past. JSON is for the dynamic, simple needs and there are plenty of other more verbose other formats for those needs. JSON works from the front-end to the back-end and there are some domain specific ways to store data that is more complex without changing the standard or if that doesn't work, move to another format. The goal of many seem to be to make JSON more complex rather than understand that it was solely created for simplicity. If it is already hard to parse it will be worse when you add in many versions of it and more complexity.

I also find it interesting that we seem to be circling back to binary and complex formats. HTTP/2 might be some of the reason this is happening and big tech turns away from open standards.

Binary formats lead to bigger minefields if they need to change often. Even when it comes to file formats like Microsoft Excel xls for example, those are convoluted and they were made more complex than needed leading Microsoft themselves to create xlsx which is XML based and even still it is more complicated than needed. Microsoft has spend lots of money on version convertors and issues with it due to their own binary choices and lockin [1].

> As Joel states, a normal programmer would conclude that Office’s binary file formats:

- are deliberately obfuscated

- are the product of a demented Borg mind or vogon mind

- were created by insanely bad programmers

- and are impossible to read or create correctly.

Binary that has to change often that is a data/storage format will be eventually convoluted because it is easier to just tack on something randomly to the end of the bin than think about structure and version updates. Eventually it is a big ball of obfuscated data. JSON and XML are at least keyed, JSON being more flexible than XML and binary to changes and versioning.

Lots of the move to binary is reminiscent of reasons before that led to lock-in, ownership and because some engineer needed to put in more complexity for those ends.

There are good and bad reasons to use all formats, if JSON doesn't suit your need for numeric precision or length and you can't store it a bigint for instance as a string with a type key describing it is a big int, maybe JSON isn't the format for the task.

Though SOAP was probably created by vogons straight up primarily as lock-in as WSDL and schemas/dtds never really looked to be interoperable but was looking to own the standard by implementing complexities with embrace, extend, extinguish in mind. SOAP and overcomplexity is the reason that web services were won by JSON/REST/HTTP/RPC as it was overcomplicated.

JSON is Javascript Object Notation and it was created for that reason, because it is so simple the usage spread to apis, frontends, backends and more. People trying to add complexities breaks it for the initial goal of the format.

JSON won due to simplicity and many want to take away that killer feature. Keeping things simple is what the best programmers/engineers do and it is many times harder than just adding in more complexity.

[1] https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...

hyperpape Apr 23, 2018

Forget XML. You can argue that JSON is better because it’s simpler, and while I’m conflicted, I know I enjoy working with JSON more.

The real question is: with the benefit of hindsight, could you define a better but similarly simple format?

Would an alternative to JSON that specified the supported numeric ranges be less simple? Not really. Would it be better? Yes. The current fact that you can try to represent integers bigger than 2^53, but they lose data, makes no sense except in light of the fact that JSON was defined to work with the quirks of JavaScript.

It's true that different tools are adapted for different uses. But sometimes one tool could have been better without giving up any of what made it useful for its niche.

drawkbox Apr 23, 2018

> The real question is: with the benefit of hindsight, could you define a better but similarly simple format?

I think the only answer to that question is to build it separate from JSON if you think it can be better, if it is truly better it will win in the market. There is no reason to break JSON and add complexity to the parsing/handling. It is 10x harder to implement simplicity than a format that meets all your needs that ultimately adds complexity.

The problem is when people want to add complexities to JSON. There is nothing stopping anyone from adding a new standard that does do that. But I will argue til the end of time that JSON is successful due to simplicity not edge cases.

Everything you mention can be implemented in JSON just as a string with type info, just because you want the actual type in the format might be the problem, it doesn't fit the use case of simplicity over edge cases. Your use case is one of hundreds of thousands people want in JSON.

> But sometimes one tool could have been better without giving up any of what made it useful for its niche.

Famous last words of a standards implementer. JSON wasn't meant to be this broad, it reached broad acceptance largely because for most cases it is sufficient and simplifies data/messaging/exchange of data. There are plenty of other standards to add complexity or build your own. You use JSON and like it because it is simple.

The hardest thing as an engineer/developer is simplifying complex things, JSON is a superstar in that aspect and I'd like to thank Crockford for holding back on demands like yours. Not because your reasons don't hold value, they do, but because it is moving beyond simplicity and soon JSON would be the past because it will have been XML'd.

In my opinion JSON is one of the best simplifications ever invented in programming and led to tons of innovation as well as simplification of the systems that use it.

If people make JSON more complex we need a SON, Simple Object Notation that is locked Crockford JSON and any dev that wants to add complexity to it will forever be put in the bike shedding shed and live a life of yak shaving.

tetha Apr 22, 2018

Correct XML parsing carries at least 1 DoS attack, 1 serverside reflection attack, and those are just the two obvious ones. Hence, any secure XML endpoint must not be conform. That's a pretty nasty situation.

And I'm still traumatized from a university project in which we tried to compile XSD into serializer/deserializer pairs in C and java. The compiler structure was easy, code generation was easy, end2end tests cross-language with on-the-fly compilation was a little tricky because we had to hook up gcc / javac in a junit runner. But XSD simple types are hell, and XSD complex types are worse.

3pt14159 Apr 22, 2018

Disagree.

I can always make my JSON act like XML if I want to. When I'm following something like JSON API v1.1 I get a lot of the advantages that I'd get from XML with 99% less bloat. You want types? Go for it! There are even official typed JSON options out there. The security / parsing issues with XML alone are enough for me to rule it out.

How many critical security issues are the result of libxml? Nokogiri / libxml accounts for 50% of my emergency patches to my servers. ONE RUBY GEM is the result of half of my security headaches. That's insane. I only put up with it because other people choose to use XML and I want their data.

How many issues are the result of browsers having to deal with broken HTML (a form of XML)?

JSON isn't perfect, and I wouldn't use it absolutely everywhere, but it's dead simple to parse[0], readable without the whitespace issues of YAML, and I can't think of one place I'd use XML over it.

[0] http://json.org/

ris Apr 22, 2018

> the whitespace issues of YAML

It's not just whitespace issues yaml has - try storing ISO two letter country codes as values and as soon as you get to norway, you've got a boolean.

There are many things which are deceptively "simple" where that actually means "I haven't thought about this very much".

orf Apr 22, 2018

Could you not just use 'no'?

ris Apr 22, 2018

Absolutely, but only once you've tracked down the unexpected bug.

koolba Apr 22, 2018

> There are even official typed JSON options out there.

What are the "official" ones?

Everything I've seen involves validation and explicit formatting for a couple specific types (ex: ISO-8601 dates) but it requires the target to specify what it expects.

There's no way to tell staring at a JSON string if "2018-04-22" is meant to be a date rather than a text string.

bmurphy1976 Apr 22, 2018

> There's no way to tell staring at a JSON string if "2018-04-22" is meant to be a date rather than a text string.

I believe the op meant you should do something like this:

"created": { "type": "datetime", "format": "iso8601", "value": "2018-04-etc" }

Now there's no ambiguity and the serialization is still json compliant. You have to let go of the notion that you can just put a date formatted string in there and things will magically work.

looperhacks Apr 22, 2018

HTML isn't XML. It's close, but it isn't. There's XHTML for that.

eponeponepon Apr 22, 2018

Just for the record - XML and HTML are both subsets of SGML, somewhat overlapping, but by no means coterminous with each other (at least until HTML 5 - I'm honestly not sure what it's relationship to SGML is).

And, speaking from experience, the XML nay-sayers should largely be glad if they never had to deal with SGML :)

teddyh Apr 22, 2018

HTML pretended to be a subset of SGML, but never really was, and the illusion quickly dispersed as time went on, since HTML was strictly pragmatic and ran in resource-constrained environments (the desktop), while SGML was academic, largely theoretical, and ran on servers, analyzing text.

XML, on the other hand, was more of a back-formation – a generalization of HTML; it was not, as I understand it, directly related to SGML in any way. The existence of XML was a reaction to SGML being impractical, so it would be strange if XML directly derived from SGML.

2 More Comments →

dfox Apr 22, 2018

The main point of HTML5 is that it is not defined in terms of SGML but by it's own grammar which is in fact described by imperative algorith for parsing it (which also unambiguously specifies what should happen for notionally invalid inputs, AFAIK to the extent that for every byte stream there is exactly one resulting DOM tree).

spiralx Apr 23, 2018

http://sgmljs.net/docs/html5.html

HTML5 is almost a subset of SGML, barring some ambiguities in itz table spec, HTML comments in script tags and the spellcheck and contenteditable attributes.

3pt14159 Apr 23, 2018

I’m aware that HTML 5 wasn’t an XML but I thought XHTML was an XML and browsers still have to support it because not everyone is on HTML 5.

Either way, my original draft included language around the distinction, but I felt I’d already written too much so I cut it.

zmix Apr 22, 2018

> I can always make my JSON act like XML if I want to.

You can make your JSON have two identities at the same time: Document and data structure?

chrisoverzero Apr 22, 2018

> […] it's dead simple to parse […]

I've heard it's a minefield.[0]

[0]: http://seriot.ch/parsing_json.php

rmrfrmrf Apr 22, 2018

If you think XML doesn't suffer from all the same issues, you haven't used it enough. I'd use protobuf for something that needs stict serialization and parsing.

jstimpfle Apr 22, 2018

I think protobuf is a binary format though?

petters Apr 22, 2018

It can be rendered to and parsed from text but that is typically not used.

mixedCase Apr 22, 2018

If I needed such strictness in parsing I'd fall back to s-expressions, not something that requires a parser like this:

  $ ls -lah /usr/lib/libxml2.so.2.9.8

  -rwxr-xr-x 1 root root 1,4M mar 27 17:46 /usr/lib/libxml2.so.2.9.8

pjscott Apr 22, 2018

It's worth pointing out that libxml2 also contains an HTML parser, implementations of XPath, XPointer, various other half-forgotten things beginning with the letter X, a Relax-NG implementation, and clients for both HTTP and FTP. The actual XML parser doesn't need any of that, and almost certainly takes up a lot less than 1.4 MB.

icebraining Apr 22, 2018

But that's the point, isn't it? S-expressions are light because they define very little, it's only a tree of undefined blobs of data (atoms). It's even more limited than JSON.

bjoli Apr 22, 2018

JSON can be mapped perfectly to s-expressions. So can xml.

Isn't JSON more or less (apart from the commas and colons) just sexprs with a simple schema and different styles of brackets?

icebraining Apr 22, 2018

I'm not saying it can't be mapped; I'm saying it loses semantics in the translation. For example, how do you represent a boolean in a s-exp, such as that anyone with "the s-exp spec" can unambiguously know that's a boolean?

6 More Comments →

bhldr Apr 22, 2018

It's easy to be strict if you are both a producer and a consumer.

lifthrasiir Apr 23, 2018

No. This is not how you should approach this problem.

The main problem with XML in this regard is a lack of proper data model. In JSON you have a single, mostly consistent data model that is `value ::= atom | [value...] | {key:value...}`. As a data format JSON is half-baked and inefficient, but as a data model it is very clear. On the other hands, XML shines when you actually need the semi-structured markup (which would be very mouthful to represent in JSON).

gcb0 Apr 22, 2018

that point is moot with xml, since xml (and don't even get me started on xslt) is as much of a mess as json.

Go try to nest a few XML documents and then came back saying if you still love it, or even if you still consider it a well-define standard.

mercutio2 Apr 24, 2018

I’m very confused. What’s hard about nesting XML documents? That’s one of XML’s strengths, and JSON’s weaknesses, because JSON doesn’t have namespaces.

inglor Apr 22, 2018

> For instance, RFC 8259 mentions that a design goal of JSON was to be "a subset of JavaScript", but it's actually not.

Actually, it's really close https://github.com/tc39/proposal-json-superset

This is a stage 3 proposal likely to make it to the next version of the spec. At which point JSON would truly be a subset of JavaScript.

tenken Apr 22, 2018

Better late than never ...

kcolford Apr 22, 2018

This brings to mind the old internet motto (someone correct me on the actual source): "be liberal in what you accept, and be conservative in what you send".

JSON is pretty clear on what certain things should mean, strings are Unicode plus escape sequences, objects map keys to values, arrays are ordered collections of values, the whole serialized payload should be Unicode, etc. Even those things can be relaxed further. This IMHO is what makes JSON so robust on the internet and the perfect choice for a non-binary communication protocol.

0xcde4c3db Apr 22, 2018

> "be liberal in what you accept, and be conservative in what you send"

This is commonly known as Postel's Law, and comes from one of the TCP RFCs [1].

[1] https://en.wikipedia.org/wiki/Robustness_principle

bhldr Apr 22, 2018

This is also widely considered a bad idea now. Making liberal consumers allows for sloppy producers. Over time this requires new consumers to conform to these sloppy producers to maintain compatibility.

Just look at the clusterfuck that HTML5 has become. You need to have extremely deep pockets to enter that market.

_greim_ Apr 22, 2018

> Just look at the clusterfuck that HTML5 has become.

Ouch. I feel like this is kind of unfair. XML, HTML1-4, and HTML5 all differ in how they treat Postel's law. XML rejects it at the spec level; if you send garbage to a parser it bails immediately, which is nice. HTML5 embraces Postel's law at the spec level. If you send garbage to an HTML5 parser, there's an agreed-on way to deal with it gracefully. Also nice. The problem was rather with HTML1-4, which embraced Postel's law promiscuously, at the implementation level. There were specs, but mainstream implementations largely ignored them and all handled garbage input slightly differently. This is what created the afore-mentioned clusterfuck.

bhldr Apr 23, 2018

Yea this is absolutely what I meant. HTML5's complexity is a symptom of this problem.

I'm a bit worried about the authors taking this overboard and trying to redefine the URL standard with similar complexity.

erik_seaberg Apr 23, 2018

HTML5 only provides the "be liberal in what you accept" error handling, they have never seen fit to write a "be conservative in what you send" grammar for authors and validators.

taeric Apr 22, 2018

Do you have a survey or other citation for it being a bad idea? I get that it enables bad behavior, per see. However, the idea of rejecting a customer/client because they did not form their request perfectly seems rather anti customer.

Ideally, you'd both accept and correct. But that is the idea, just reworded.

shagie Apr 22, 2018

The Harmful Consequences of Postel's Maxim - https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0... (HN from 2015 https://www.hackerneue.com/item?id=9824638 )

Wrestling with Postel’s Law https://techblog.workiva.com/tech-blog/wrestling-postel’s-la...

5 More Comments →

treve Apr 22, 2018

The problem with this idea is that different consumers might have a different subset of what they accept and correct.

If some of those become dominant, produces might start depending on that behavior and it becomes a de facto standard. This is literally what has happened to HTML, but holds true for many other Internet protocols.

If you're looking for some external reading, I found at least this:

* https://tools.ietf.org/html/draft-thomson-postel-was-wrong

I think you'll find few protocol designers arguing _for_ the robustness principle these days.

3 More Comments →

astrobe_ Apr 22, 2018

It goes against safety.

"Accept and correct" in the absence of ECC is just delusion if not hubris. The sender could be in a corrupted state and could have sent data it wasn't supposed to send. Or the data could have been corrupted during transfer, accidentally or deliberately. You can't know unless you have a second communication channel (usually an email to the author of the offending piece of software), and what you actually do is literally "guess" the data. How can it go wrong?

taeric Apr 22, 2018

In the world of signed requests, but flips are less of a concern. If the signature doesn't match, reject the call. Which implies I clearly don't mean accept literally everything. Just work in your confines and try to move the ball forward, if you can. This is especially true if you are near the user. Consider search engines with the "did you mean?" prompts. Not always correct, but a good feature when few results are found.

For system to system, things are obviously a but different. Don't just guess at what was intended. But, ideally, if you take a date in, be like the gnu date utility and try to accept many formats. But be clear in what you will return.

And, typically, have a defined behavior. That could be to crash. Doesn't have to be, though. Context of the system will be the guide.

Boulth Apr 22, 2018

Consumer is interested in fulfilling their need so they will fix their request so that it gets processed.

3 More Comments →

digi_owl Apr 22, 2018

The HTML5 clusterfuck comes from having the biggest players being allowed to adjust the goal as they see fit, when they see fit (aka "living document").

rhapsodic Apr 22, 2018

> Just look at the clusterfuck that HTML5 has become. You need to have extremely deep pockets to enter that market.

What do you mean by "enter that market"?

hyperdimension Apr 22, 2018

I think they mean needing deep pockets to write a new browser, with all the complexity that modern HTML+JS entails.

2 More Comments →

kbouck Apr 22, 2018

Here are some thoughtful arguments against "Postel's Law":

https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0...

jwilk Apr 22, 2018

https://tools.ietf.org/html/rfc761#section-2.10

cup-of-tea Apr 22, 2018

I learnt that as the principle of robustness.

Groxx Apr 22, 2018

This seems like more of a problem with parsers not following the spec. It's a simple spec, but so strict and restrictive that it's a bit of a pain to give to humans, and small extensions (like comments) are immensely useful for its (ab)uses as configuration "DSL"s. And some edges like that the string format isn't specified - that's (basically) fine, it's not a connection-negotiation protocol.

So JSON parsers tend to implement some weird, unspecified, inconsistent superset of JSON. I haven't encountered one yet that fails to parse valid JSON though. That doesn't seem to imply that parsing JSON is a minefield. Only that parsing human input, nicely, is a minefield. No spec avoids the human ergonomic problem simply.

protomikron Apr 22, 2018

In my opinion most parsers are just too lax. They support some fancy extension at the beginning (like comments, which are not a good idea), get more and more "feature-rich" and then support syntax that is not specified in the standard.

Other parsers now have to lower their "standard" (no pun intended) to compete which leads to more complex edge-cases that we also find in undefined behaviour and compilers.

E.g. if your HTML is broken, it mostly renders somehow in your browser, which is in my opinion bad design - the same is probably true with JSON.

snowpanda Apr 22, 2018

I totally agree. JSON to me is actually pretty straight forward, it's the parsers that interpret it differently.

teliskr Apr 22, 2018

There are many things in tech which have caused great grief in my life as a programmer. JSON is not one of them.

hsivonen Apr 22, 2018

This document is missing the XMLHttpRequest/Fetch JSON profile. ECMA operates on a sequence of Unicode code points (really code points, not scalar values!). WHATWG defines how you go from bytes over HTTP to something you can pass to the ECMA-specified JSON parser.

jwilk Apr 22, 2018

Previously:

https://www.hackerneue.com/item?id=12796556

gumby Apr 22, 2018

We had a nasty liberal-in-what-you-accept JSON problem: using JSON to communicate between services written in various languages (Python, Java, Javascript, C++): the python client was simply writing maps out which were automatically serialized into something almost JSON: {'label': 123} (using ' to delimit the label strings, not "). The Javascript JSON parser would silently accept this, as would some of the Java libraries, while both the C++ parsers we used would reject it. This was a pain to debug since some of the modules communicated seemingly perfectly, and of course those developers didn't see why they should change.

bastawhiz Apr 22, 2018

JSON is mostly a strict subset of Python. That's not unexpected, but it makes me question how something like this actually happened. Your bug likely resulted from someone doing a `str(obj)` instead of `json.dumps(obj)`. Hardly the fault of JSON for being very similar to Python's default string serialization.

gumby Apr 23, 2018

That is precisely what they did, yes, and some of the client code (e.g. JavaScript) DWIMed it (accepted it)

orf Apr 22, 2018

It was likely that the client was not serialising the data at all, i.e passing a dict to requests.post() in the body parameter. This ends up casting it to a string, hence the single quotes.

protomikron Apr 22, 2018

> almost JSON: {'label': 123}

Almost is not enough. The parser of an API should reject to transform this into an object. It may be valid Python, but I would blame the first parser (see discussion on Postel's law here).

smel Apr 22, 2018

It's inevitable, nothing is perfect ... there is only popular things that everyone complains about and things that nobody cares about :D

We're not machines we're more comfortable with messy and forgiving systems.

Do you want to build successful products? be liberal on input and conservative on output. You need to reduce entropy and give people a feeling of magic. It's when you're old and have enough scars on your skin when you learn to hate magic and become control freak :D

keymone Apr 22, 2018

i wish there was a chance for EDN[1] to replace JSON. it's a shame the industry defaulted to a subset of javascript as a data notation format considering all it's shortcomings =/

yeah, i get it, "but it has native support in all browsers" is a valid argument, i just wish it wasn't.

[1] https://github.com/edn-format/edn

xfer Apr 22, 2018

In what regard EDN is "better" than JSON? the point of the post was the RFC specification is not tight and there are corner cases. I don't see any rigor in the given link either..

keymone Apr 22, 2018

it is better in regard that:

- there exists single reference implementation, which rules out points like scalars not being valid JSON in some parsers despite being part of the spec

- it is extensible in a manner that never invalidates the syntax for parsers that do not use corresponding extensions (this is huge actually)

- comments are part of the spec, so it essentially replaces both json and yaml

- richer set of primitive types

- commas are whitespace (best feature ever)

the rest should definitely be handled in BNF spec, but the above makes EDN immediately much better than JSON

draegtun Apr 27, 2018

It's even a bigger shame that Crockford wasn't able to go with Rebol (instead of "discovering" JSON), which he was originally pushing/planning :(

Walkman Apr 22, 2018

If parsing JSON is a minefield, what about YAML? :D

Groxx Apr 22, 2018

A minefield where the mines hunt you down, rather than waiting for you to step near.

rurban Apr 22, 2018

YAML is not a minefield, YAML is a joke.

rurban Apr 22, 2018

Parsing JSON is not a minefield. It is technically trivial and pretty secure. Compared to other specs it's not that bad, but of course there are still some security concerns, esp. in the last two JSON RFC updates, which made it worse and not better.

But most other commonly uses transport formats are much worse, and much harder to parse. Start reading at http://search.cpan.org/~rurban/Cpanel-JSON-XS-4.02/XS.pm#RFC...

rgovostes Apr 23, 2018

I don't think it's a rule that JSON parsers are, in general, "pretty secure." Even if the parser itself is not vulnerable (to say, hitting recursion limits), how duplicate keys are handled between parsers has led to security vulnerabilities in the past for other things such as GET parameters. Or suppose an attacker gets a message through a few layers and that then causes a backend server to fail, like with the Swift errors he talks about, causing data loss.

ourcat Apr 23, 2018

Try RSS.

Having built a system years ago, to try an parse tens of thousands of feeds, there's a huge amount of 'fuzzy logic' required to put it all in order.

Despite the spec.

bmn__ Apr 23, 2018

There were nine different and incompatible versions.

http://web.archive.org/web/2004/http://diveintomark.org/arch...

nnq Apr 23, 2018

In a different realm, there are people that find even ol' JSON annoyingly strict, and we prefer to just grab JSON5 instead (https://github.com/json5/json5) at least for system-local configs.

..there was a wise saying about how you gotta "stop worrying and love the bomb" ;)

rbalsdon Apr 23, 2018

Im happy to say that my own JSON parser (https://github.com/ryanbalsdon/cerial) passed a lot more of these tests than I expected! It isn’t a general parser though (requires a pre-defined schema) so is probably cheating.

falcor84 Apr 22, 2018

I just noticed that the recursion depth test mentions 10000 opening brackets, while the test code uses `'['*100000` (one order of magnitude more). I am curious about the actual recursion depth they can handle but don't have access to xcode myself.

k__ Apr 22, 2018

Reminds me of some JSON I got from an API.

It was always malformed and I always wrote the dev that he should fix it.

He always did, but every new endpoint was malformed again.

One day I looked at the code and it was full of string concatinations of DB results...

caf Apr 23, 2018

  select '{ "user": { "name": "' || u.name || '", "email": ' || u.email || '" } }' as json from users u;

oh my.

Izkata Apr 23, 2018

...I don't know if it was intentional or not, but you're missing a comma.

edejong Apr 22, 2018

If you think parsing JSON is hard, try parsing/generating streaming JSON while limiting memory bandwidth requirements. Fun exercise, with a push-down automaton.

Animats Apr 23, 2018

Parsing UTF-8 in the presence of errors is a huge headache in itself.

UTF-8 with "byte order marks"? That makes no sense.

nickthemagicman Apr 22, 2018

Why can't there be a subset of JSON or XML called like strict mode and it's a much more sane version of the ddl?

habitue Apr 22, 2018

Why can't there be?

drawkbox Apr 22, 2018

Before JSON, XML and standard binary formats, there were just CSV/TSV and random binary formats which was a bigger minefield. Simply exchanging data was a project in itself.

At least JSON and XML are text based when it comes to data exchange. Back in the day before APIs that needed to exchange data cleanly, without JSON/XML, exchanging data was not only a minefield but one with constant carpet bombing. The fact that edge cases that are rarely run into is all that is left of data exchange issues is a huge advancement.

What is great is in most cases JSON works fantastic and simplifies data exchange and APIs all the way to front ends and backends. XML is available if needed. So are binary standard formats now for really compact areas like messaging for performance that humans may never see or you may never have a third party that needs to parse it. The task of parsing xml in client side javascript is not fun especially, neither is binary parsing where adding a value can break the whole object, JSON keys can come and go.

The engineer can choose the tool for the job but there better be a good reason to use anything over simple JSON, almost any problem can be solved with it. Engineers should aim to take data complexity and make it as simple as possible, not take simple and make it complex for job security, real engineers always move more simple when possible and away from vogon ways.

For data that is exchanged between services and front-end/backend, JSON is the simplified format that makes things move faster. XML got tarred and devolved into vogon sludge with SOAP services and nested namespacing/schemas but still is needed in some areas. Binary standard formats when you control both sides or noone else needs to connect to it or you don't need it on the front end maybe or possibly you need performant real-time messaging. There is also YAML if you need more typing or BSON where binary needed but still simple. All formats have good uses and bad but using binary when JSON will suffice is not being as simple as possible.

JSON is easy to get around and is more lightweight, if you run into a problem you can just restructure your JSON to make it work where binary or XML take more work to change without breaking changes, especially downstream causing many more versions and conversions. JSON is a data messaging format meant to simplify. Most of the issues in the OP article could be solved storing the values in a string with a "type" or "info" key that allows conversion in the backend i.e. long numbers or hex etc or storing binary as base64 etc.

JSON is based on basic CS types in objects, lists, simple data types like string, number, bool, date, this makes for a simplifying of all systems that serialize and deserialize to it. JSON helps spread simplicity while being dynamic.

JSON works best with ever changing dynamic data/code/projects we build today and in seconds you can be consuming data from third party APIs faster than any other format and more simplistically, that is why it won.

dboreham Apr 22, 2018

You forgot BER ;)

dfox Apr 22, 2018

And XDR, although in it's case it is slightly unclear whether it is standardized or random binary format ;)

eadmund Apr 22, 2018

> Before JSON, XML and standard binary formats, there were just CSV/TSV and random binary formats which was a bigger minefield.

S-expressions predate both, are simpler to parse than either, are more legible than both and are cheaper than either.

Here's a JSON example from http://json.org/example.html:

    {
        "glossary": {
            "title": "example glossary",
    		"GlossDiv": {
                "title": "S",
    			"GlossList": {
                    "GlossEntry": {
                        "ID": "SGML",
    					"SortAs": "SGML",
    					"GlossTerm": "Standard Generalized Markup Language",
    					"Acronym": "SGML",
    					"Abbrev": "ISO 8879:1986",
    					"GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
    						"GlossSeeAlso": ["GML", "XML"]
                        },
    					"GlossSee": "markup"
                    }
                }
            }
        }
    }

In XML it'd be:

    <!DOCTYPE glossary PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
     <glossary><title>example glossary</title>
      <GlossDiv><title>S</title>
       <GlossList>
        <GlossEntry ID="SGML" SortAs="SGML">
         <GlossTerm>Standard Generalized Markup Language</GlossTerm>
         <Acronym>SGML</Acronym>
         <Abbrev>ISO 8879:1986</Abbrev>
         <GlossDef>
          <para>A meta-markup language, used to create markup
    languages such as DocBook.</para>
          <GlossSeeAlso OtherTerm="GML">
          <GlossSeeAlso OtherTerm="XML">
         </GlossDef>
         <GlossSee OtherTerm="markup">
        </GlossEntry>
       </GlossList>
      </GlossDiv>
     </glossary>

And as an S-expression it'd be:

    (glossary (title "example glossary")
              (div
               (title S)
               (list
                (entry (id SGML)
                       (sort-as SGML)
                       (term "Standard Generalized Markup Language")
                       (acronym SGML)
                       (def (para "A meta-markup language, use to create markup languages such as DocBook.")
                            (see-also GML XML))
                       (see markup)))))

Which is, I believe, a huge improvement.

djur Apr 23, 2018

The S-expression has cleaner whitespace and field names than the JSON, which makes it harder to make an apples-to-apples comparison.

But the biggest problem with that S-expression is that I don't know how to parse it. Is SGML a symbol, identifier, a quoteless string? How do I know when parsing the 'entry' field that what follows is going to be a list of key/value pairs without parsing the whole expression? Is 'see-also GML XML' parsed as a list? How do we distinguish between single element lists and scalars? Is it possible to express a list at the top level, like JSON allows? How do you express a boolean, or null?

Of the problems outlined in the OP, S-expressions solve one: there's no question of how to parse trailing commas because there are no trailing commas. They do not solve questions of maximum levels of nesting. They have the same potential pitfalls with whitespace. They have exactly the same problems with parsing strings and numbers. They have the same problem with duplicated keys.

My point here isn't that you can't represent JSON as S-expressions. Clearly you can. My point is that in order to match what JSON can do, you have to create rules for interpreting the S-expressions, and those rules are the hard part. Those rules, in essence, _are_ JSON; once you've written the logic to serialize the various types supported by JSON to and from S-expressions, you've implemented "JSON with parentheses and without commas".

eadmund Apr 23, 2018

> Is SGML a symbol, identifier, a quoteless string?

It's a sequence of bytes — a string, if you like.

> How do I know when parsing the 'entry' field that what follows is going to be a list of key/value pairs without parsing the whole expression?

You wouldn't, and as a parser you wouldn't need to. The thing which accepts the parsed lists of byte-sequences would need to know what to do with whatever it's given, but that's the same issue as is faced by something which accepts JSON.

> Is 'see-also GML XML' parsed as a list?

(see-also GML XML) is a list.

> How do we distinguish between single element lists and scalars?

'(single-element-list)' is a single-element list; 'scalar' is a scalar. Just like '["single-element-list"]' & '"scalar"' in JSON.

> Is it possible to express a list at the top level, like JSON allows?

That whole expression is a list at top level.

> How do you express a boolean, or null?

The same way that you represent a movie, a post or an integer: by applying some sort of meaning to a sequence of bytes.

> They do not solve questions of maximum levels of nesting.

They don't solve the problem of finite resources, no. It'll always be possible for someone to send one more data than one can possibly process.

> They have the same potential pitfalls with whitespace.

No, they don't, because Ron Rivest's canonical S-expression spec indicates exactly what is & is not whitespace.

> They have exactly the same problems with parsing strings and numbers.

No they don't, because they don't really have either strings or numbers: they have lists an byte-sequences. Anything else is up to the application which uses them — just like any higher meaning of JSON is up to the application which uses it.

> They have the same problem with duplicated keys.

No, they don't — because they don't have keys.

> My point is that in order to match what JSON can do, you have to create rules for interpreting the S-expressions, and those rules are the hard part.

My point is that JSON doesn't — and can't — create all the necessary rules, and that trying to do so is a mistake, because applications do not have mutually-compatible interpretations of data. One application may treat JSON numbers as 64-bit integers, another as 32-bit floats. One application may need to hash object cryptographically, and thus specify an ordering for object properties; another may not care. Every useful application will need to do more than just parse JSON into the equivalent data structure in memory: it needs to validate it & then work with it, which almost certainly means converting that JSON-like data structure into an application-specific data structure.

The key, IMHO, is to punt on specifying all of that for everyone for all time and instead to let each application specify its protocol as necessary. The reason to use S-expressions for that is that they are structured and capable of representing anything.

Ultimately, we can do more by doing less. JSON is seductive, but it'll ultimately leave one disappointed. It does a lot, but not enough. S-expressions do enough to let you do the rest.

djur Apr 23, 2018

I hope you understand that those questions were rhetorical -- they're questions that do not need to be asked about the equivalent JSON representation. Questions developers don't have to ask each other about the data they're sending each other.

The canonical S-expression representation solves some of the problems JSON has, true, but the example you provided is not a canonical S-expression. It wouldn't make sense for it to have been, because canonical S-expressions are a binary format and not comparable in this context to JSON or XML.

Application developers voted with their feet for serialization formats with native representations of common data types (strings, numbers, lists, maps, booleans, null). There's a lot of reasons that JSON has supplanted XML, but one of them is that JSON has these types built in and XML does not. A lot of real-world data interchange and storage can make good use of those primitives. Many problems boil down to "how do I pass around a list of key/value records". There is a lot to say for not having to renegotiate that kind of basic detail every time two applications need to communicate.

You can represent S-expressions as JSON strings and arrays. I've done it. It was the best way to represent the data I was trying to store, but that's because the data was already represented as S-expressions. I've never seen anyone else do it, and that doesn't surprise me. For most purposes JSON is used for, it is more useful than S-expressions -- not necessarily more powerful, but more useful.

tonyg Apr 23, 2018

Interpreted as a Rivest S-expression, the example given above conforms to the "advanced transport representation" [1], and so can automatically and straightforwardly be converted to the "canonical representation" [2].

In an important sense, then, I'd claim that it is a "canonical S-expression".

The reason this works is because SPKI S-expressions aren't just a grammar for a syntax, they also come with [3] a total /equivalence relation/, which is exactly what JSON lacks and which is what makes JSON such a pain to work with.

In other words, SPKI S-expressions have a semantics. JSON doesn't.

Lots of other "modern" data languages also lack equivalence relations, making them similarly difficult to use at scale.

[ETA: Of course, your point about lacking common data types is a good one! My fantasy-land ideal data language would be something drawing from both SPKI S-expressions and BitTorrent's "bencoding", which includes integers and hashes as well as binary blobs and lists.]

---

[1] Section 6.3 of http://people.csail.mit.edu/rivest/Sexp.txt

[2] Section 6.1 of http://people.csail.mit.edu/rivest/Sexp.txt

[3] The SPKI S-expression definition is still a draft and suffers a few obvious problems - ASCII-centrism and the notion of a "default MIME type" being two major deficits. Still, I'd love to see the document revived, updated, and completed. Simply having an equivalence relation already lifts it head and shoulders above many competing data languages.

drawkbox Apr 23, 2018

s-expressions are better than binary for sure but also you end up having to maintain/write parsers for front-end/back-end and more. s-expressions influenced HTML/XML creation. Anything not JSON/XML you end up with formats that don't have massive support on the client and server side and take more work to serialize/deserialize to/from, same with YAML, other formats that have more typing and rules are not as simple and do add some complexity.

The big reason that JSON and even XML were so successful is that parsing from front-end to back-end and the direct use in javascript and APIs for instance is such a simple step, JSON being easier than XML but XML being easier than binary and other formats with more requirements/rules/complexity.

The basic types, ease of nesting, readability used in both JSON and even XML led to systems that serialize/deserialize to it also influenced the systems to be more simple.

CSV/XLS/binary/BER/DER/ASN.1 etc data exchanging all have more mines in the field than JSON and XML has more than JSON. JSON's killer feature is simplicity and forces you into more simple input/output.

Simplicity is always good when it comes to exchanging data.

tannhaeuser Apr 23, 2018

Just wanted to share that you can technically parse S-expressions using SGML.

[1]: https://web.archive.org/web/19991008044801/http://www.blnz.c...

pedrorijo91 Apr 22, 2018

I'm always saying that I can't understand how do we have a new hipster programming language/framework every year, but we still struggle dealing with JSON

mlvljr Apr 22, 2018 (dead)

fwdpropaganda Apr 22, 2018

Damn, the people that work on these kind of things are heros.

DONT_BE_A_MORON Apr 23, 2018 (dead)

Froyoh Apr 22, 2018

Gson is the best one out there

threepipeproblm Apr 22, 2018

TOML is supposed to be easy to parse.

your-nanny Apr 22, 2018

Doing God's work, man. Helluva job

This item has no comments currently.