Comment by SamReidHughes

SamReidHughes Apr 22, 2018 parent

No, this is not true of many reasonable formats. You don't have to make an obtusely nontrivial format to encode the data JSON does.

arghwhat Apr 22, 2018

JSON is fairly trivial. The post is a nonsensical rant about parsers accepting non-JSON compliant documents (as the JSON spec specifically states that parsers may), such as trailing commas.

In the large colored matrix, the following colors mean everything is fine: Green, yellow, light blue and deep blue.

Red are crashes (things like 10000 nested arrays causing a stack overflow—this is a non-JSON-specific parser bug), and dark brown are constructs that should have been supported but weren't (things like UTF-8 handling, which is again non-JSON specific parser bugs).

Writing parsers can be tricky, but JSON is certainly not a hard format to parse.

SamReidHughes OP Apr 23, 2018

As a question of fact, programs put out JSON that gets misparsed by other programs. Some simply parse floating point values differently, or they treat Unicode strings incorrectly, or output them incorrectly. Different parsers have different opinions about what a document represents. This has a real world impact.

Accepting invalid or ambiguous or undefined JSON is not an acceptable behavior. It means bugs get swallowed up and you can't reliably round trip data.

repsilat Apr 23, 2018

> Accepting invalid or ambiguous or undefined JSON is not an acceptable behavior

Just to make it explicit (and without inserting any personal judgement into the conversation myself): JSON parsers should reject things like trailing commas after final array elements because it will encourage people to emit trailing commas?

Having asked the question (and now explicitly freeing myself to talk values) it's new to me -- a solid and rare objection to the Robustness Principle. Maybe common enough in these sorts of discussions, though? Anyway, partial as I might be to trailing commas, I do quite like the "JSON's universality shall not be compromised" argument.

SamReidHughes OP Apr 23, 2018

Postel's Law or the "Robustness Principle" is an anti-pattern in general.

Accepting trailing commas in JSON isn't as big a deal as having two different opinions about what a valid document is. But you might think a trailing comma could indicate a hand-edited document that's missing an important field or array element.

arghwhat Apr 23, 2018

Unicode and floating point misparsing are not even remotely JSON related, but are simply bugs that can occur in any parser that handles unicode or floating point. Thus, complaining about it in a "JSON is a minefield" thread is a bit silly.

If you put out JSON that gets misparsed, you either generated invalid JSON, or the parser is faulty. Nothing around that.

This has nothing to do with whether parsers have flexibility to accept additional constructs, which is extremely common for a parser to do.

Annatar Apr 23, 2018

Actually unless one is doing JavaScript, JSON is extremely difficult to parse correctly. I challenge you to write a simple, understandable JSON parser in Bourne shell or in AWK.

repsilat Apr 23, 2018

> JSON is extremely difficult to parse correctly ... in Bourne shell or in AWK.

Sorry for the misquote, but does it get to the heart of your objection?

I'm torn here. On the one hand I want to say "Those are not languages one typically writes parsers in," but that's a really muddled argument:

1. People "parse" things often in bash/awk because they have to -- because bash etc deal in unstructured data.

2. Maybe "reasonable" languages should be trivially parseable so we can do it in Bash (etc).

I'm kinda torn. On the one hand bash is unreasonably effective, on the other I want data types to be opaque so people don't even try to parse them... would love to hear arguments though.

arghwhat Apr 23, 2018

You definitely shouldn't write a parser for anything in bash.

If you want to deal with JSON, I'd recommend jq as an easy manipulation tool.

Annatar Apr 23, 2018

I wrote Bourne shell and both of you assumed bash. Horrid state of affairs in today’s IT. All hail Linux hegemony!

And AWK, if it’s not easily parsable with the language specifically designed for parsing data, something is wrong with that data.

Annatar Apr 23, 2018

Shell and AWK programs have fewest dependencies, are extremely light on resources’ consumption and are extremely fast. When someone’s program emits JSON because the author assumed that programs which ingest that data will also be JavaScript programs, that’s a really bad assumption. It would force the consumer of that data to replicate the environment. This goes against core UNIX principles, as discussed at length in “The art of UNIX programming” book. It’s a rookie mistake to make.

nnq Apr 23, 2018

> It would force the consumer of that data to replicate the environment.

It won't, because JSON is a standard. Imperfect like all standards but practically good enough. And "plain text" just means "an undefined syntax that I have to mostly guess". And nobody "programs" in bash or awk anymore. The "standard scripting languages" for all sane devs are Python or Ruby (and some Perl legacy) and parsing JSON in them is trivial.

The "UNIX philosophy" was a cancerous bad idea anyway and now it's thankfully being eaten alive by its own children, so time to... rejoice?!

EDIT+: Also, if I feel truly lazy/evil (like in "something I'll only in this JS project"), I would use something much much less standard than JSON, like JSON-5 (https://github.com/json5/json5), which will practically truly force all consumers to use JS :P

3 More Comments →

arghwhat Apr 23, 2018

Why in the world would I write a parser in bash or awk, regardless of the format? I certainly have better things to do. It doesn't matter how sensible a format is, those tools are simply not appropriate to write a parser in.

I have, however, written a JSON parser before at a previous company in C++ (they didn't want to pull in a library). It wasn't particularly hard.

And yes, like any other parser, it accepted additional non-JSON constructs. This was simply because it was take additional work to error out on those constructs, which would be a waste of time.

Annatar Apr 23, 2018

Why in the world would I have to use JavaScript or any object oriented programming language just because the application is poorly thought out and emits JSON? I certainly have better things to do.

It doesn't matter how sensible a format is, those tools are simply not appropriate to write a parser in.

AWK is a language designed for data parsing and processing. That is what it is designed to do.

How did you solve the parsing of arbitrarily nested structures?

jhomedall Apr 23, 2018

> How did you solve the parsing of arbitrarily nested structures?

Write a recursive descent parser, or use a parser generator.

Or, more realistically, use one of the many libraries available for parsing it (pretty much every language has one at this point).

mikerybka Apr 23, 2018

Hope one of these helps with your problem:

https://github.com/step-/JSON.awk

https://github.com/dominictarr/JSON.sh

rarepostinlurkr Apr 24, 2018

Json is clearly data.

If awk cannot parse it, perhaps it’s a shortcoming of awk? =)

Annatar Apr 24, 2018

It’s not an AWK limitation; recursive descent parsers are non-trivial to implement and tricky to get right.

IshKebab Apr 22, 2018

Most of the things they test are true of any text-based format, and many of them are true of any serialisation format. E.g. 100000 opening brackets. You could do the same in XML for example and I expect many parsers would fail.

jandrese Apr 23, 2018

Maybe the difference is that nobody ever thought that XML was easy to parse.

SamReidHughes OP Apr 23, 2018

XML is far worse than JSON, it is known. It's a lot easier to screw up a text based format than a binary format. But it would also be possible to make a better text based format than JSON, mainly by defining down what numerical values are allowed to be represented and exactly how they get parsed, and making it harder to screw up strings. That's where most of the problems are.

jandrese Apr 23, 2018

I think the reason JSON doesn't take well to strong typing is that it is designed for Javascript. When you pass some Javascript object to the JSON encoder how is it supposed to decide which numeric representation to use? Does it always go for the smallest one that works? The Javascript decoder at the other end is just going to throw away that type information anyway, so all it is good for is verifying the data while you are parsing it. Maybe not a totally useless thing to do but it's a lot of work to get that modest benefit.

SamReidHughes OP Apr 23, 2018

Making an encoding tailored to JavaScript documents that is a binary format is easy. And I don't think strong typing has anything to do with it. Making it handle Python and Ruby objects at the same time is harder, because they have different opinions about what a string is, what a number can be.

_pmf_ Apr 23, 2018

> It's a lot easier to screw up a text based format than a binary format.

I don't see a single instance where this is true.

always_good Apr 23, 2018

I notice that in a topic where it'd be so easy and even necessary to rattle off a few names/examples, you've chosen not to do it.

This item has no comments currently.