Preferences

JasonFruit parent
This is interesting and important in one way: anything poorly specified will eventually cause a problem for someone, somewhere. That being said, my first response was to complete the title, ". . . yet it remains useful and nearly trouble-free in practice." There's a lot of, "You know what I mean!" in the JSON definition, but in most cases, we really do know what Crockford means.

Someone
If your API takes json input, some of those issues are potential security or DoS issues.

For example, if you validate your json in your web front-end (EDIT: I used the wrong term. What I meant here is the server-side process that’s in front of your database) and then pass the string received to your json-aware database, you’re likely using two json implementations that may have different ideas about what constitutes valid json.

For example, a caller might pass in a dictionary with duplicate key names, and the two parsers might each drop a different one, or one might see json where the other sees a comment.

helaan
Reminds me of last years CouchDB bug (CVE-2017-12635) which was caused by two JSON parsers disagreeing on duplicate keys: here it was possible to add a second key with user roles, allowing a user to give admin rights to itself. JSON parser issues are real.
xenadu02
One of the benefits of serialization technology (like Codable+JSONEncoder in Swift or DataContract in C#) is that you get a canonical representation of the bits in memory before you pass the document on to anyone else.

By representing fields with enums or proper types you get some constraints on values as well, eg: If a value is really an integer field then your type can declare it as Int and deserialization will smash it into that shape or throw an error, but you don't end up with indeterminate or nonsense values.

This can be even more important for UUIDs, Dates, and other extremely common types that have no native JSON representation, nor even any agreed-upon consensus around them.

You get less help from the language with dynamic languages like Python but you can certainly accomplish the same thing with some minimal extra work. Or perhaps it would be more accurate to say languages like Python offer easy shortcuts that you shouldn't take.

In any case I highly recommend this technique for enforcing basic sanitization of data. The other is to use fuzzing (AFL or libFuzzer).

SOLAR_FIELDS
This specific RCE vulnerability was actually given as an explicit example of the consequences of the current state of the specifications.
mjevans
Normalize before approval and add filters that only allow in /expressly approved/ items from insecure environments.
paradite
I think it is rather common sense to do data validation on backend instead of frontend. What matters is that backend always acts as the source of truth, it doesn't really matter if frontend and backend are inconsistent as long as we know that backend data is correct.
guntars
I sure hope you don’t just put random user provided blobs in your database, even if they’re validated. Also, how do you validate without parsing? If it’s parsed, might as well serialize again when saving to the DB.
Someone
”If it’s parsed, might as well serialize again when saving to the DB”

You didn’t grow up in the 1980’s, I guess :-)

Why spend cycles serializing again if you already have that string?

toast0
Because experience has shown us that today's parsers don't detect tomorrow's 0-day parsing bugs; but serializing a clean version of what was parsed is more likely to be safe (see lots of jpeg, mpeg, etc exploits)
Someone
More likely, yes, but it need not help you here. Let’s say Chuck sends

  {“command”:”feed”, “command”:”kill”}
Alice uses json parser #1. It keeps both “command” entries.

Alice next checks the “command” value against a whitelist. Her json library reads the first value, returning the benign “feed”.

Alice next serializes the parsed structure and sends it to Bob. The serializer she uses returns the exact string Eve sent.

Bob, using a different json parser, parses the json. That parser drops the first “command”, so he gets the equivalent of

  {“command”:”kill”}
Since Bob trusts Alice, he executes that command.

What would help here is if Alice generated a clean copy of what she thinks she received, and serialized that. For more complex APIs, that would mean she has to know the exact API that Bob expects, though. That may mean extra work keeping Alice’s knowledge of the ins and outs of the API up to date als Bob’s API evolves.

If it's parsed, why even store in in a database as JSON at all?

If you don't do that... then multiple possible JSON parsers aren't a problem.

Mikhail_Edoshin
Use case: I sync local data with web API. I do not use all the data I receive, only a few bits, but if I modify them, I have to send a complete object back to the server with all the other data. The simplest way to do this is to store the original JSON.

The CardDAV and CalDAV are not JSON, but their specification also requires you to preserve the whole vCard if you ever want to send your changes back to the server. CardDAV data may be accessed by multiple apps and they are allowed to add their private properties; any app that deals with vCards must preserve all properties, including those it doesn't understand or use.

couchand
One common reason is to provide a flexible table for things that may not have an identical schema. For instance, an event log might have details about each event that differ based on the event type.
leothelocust
Fully agree. You can cause parsing issues, but you can also... not.

If you are creating a JSON response from your own API, you control the JSON output.

Unless you are crafting JSON from scratch I doubt anyone runs into the issues mentioned in the OP.

bufferoverflow
> anything poorly specified

I thought JSON was specified quite clearly.

http://json.org/

There are no limits of the loopy things (the number of consecutive digits in numbers), but I don't consider that a weakness of the standard.

Most of the tests that I see do pass completely invalid JSON.

http://seriot.ch/json/pruned_results.png

spc476
So, 9223372036854775807 is a valid number per the json.org spec, but good luck getting a typical JSON decoder to process that number. A couple I tried returned it as 9.2233720368548e+18, which is not the same number.
Isn't that a limitation of the language and API rather than the parser/decoder? I would guess that most users don't want a JSON decoder that depends on some library for arbitrary-precision numbers and returns such numbers for the user's inconvenience.

The summary table suggests that a real bug was found in about half of the parsers tested, and even a few of those bugs belong to a category that one might choose to ignore: almost any non-trivial function can be made to run out of memory, and lots of functions will crash ungracefully when that happens, rather than correctly free all the memory that was allocated and return an error code, which the caller probably doesn't handle correctly in any case because this case was never tested.

This item has no comments currently.