Parsing JSON Is a Minefield (2018)

90 points Jun 2, 2024

rendaw Jun 2, 2024

The primitive types JSON specifies are redundant and generally only lead to issues. Almost all JSON consumers are either deserializing to a spec that already contains type information, frequently richer, with even more variety of types (url, telephone number, UUID, not just "string"), and even without a spec code will be written to need a specific type (i.e. you're not going to write code to accept an integer when you want a person's name).

It would be much simpler if all primitives were strings, and it'd probably save a few people from accidentally doing the wrong thing while dealing with prices.

aftbit Jun 2, 2024

Perhaps. I've often wished that JSON supported some sort of custom types or type annotations, or failing that, at least datetimes. Some other nice extensions would be support for comments and optional trailing commas.

There is something very nice and expressive about the existing JSON types. Just 6 types (null, boolean, string, number, array, and dictionary) are enough to cover a ton of use cases, and as you suggest, one can always fall back to "stringly typed" alternatives by implementing one's own serialization and deserialization for extra types.

ooterness Jun 2, 2024

You may be interested in CBOR (IETF RFC 8949).

CBOR features are almost one-to-one with JSON, except that the encoding is more size-efficient, it supports a few additional types (e.g., integers and floats are separate), and it allows semantic tags.

https://en.wikipedia.org/wiki/CBOR

zzo38computer Jun 2, 2024

There are some benefits of CBOR (having a separate integer type is good, and a byte string type is good, and they have typed numeric arrays which is good also, etc), but also some problems. For example, I might have preferred that Unicode is a tag rather than a type (other tags can be used for other character sets), and base64-encoded strings also seems unnecessary (since it is a binary format anyways, you should just use the binary data directly instead), and I think it would be better for a MIME message to be treated as a byte string instead of Unicode (fortuantely the specification allows that, but it seems to just be "added on" afterward due to a lack of consideration), and possibly maybe it might be better to disallow the types of keys to be arrays and maps.

However, some of the things I mentioned above, do have benefits for interoperability with JSON, although they aren't good for a general-purpose use; I think that it would generally be better to make a good format rather than trying to work only with the bad ideas of other specifications. (Fortunately, I think what I described above could be implemented using a subset of CBOR.)

However, using these formats (whether CBOR or JSON) is often more complicated than should be needed for a specific use anyways.

murmansk Jun 2, 2024

While it might be great in theory, CBOR has own separate set of dragons waiting for you.

Expectation: tags in CBOR allow you to pass semantics. Reality: multitude of tags, and absence of strict rules for the tags make it pain in the ass.

kibwen Jun 2, 2024

Let's make a distinction here between serialization formats and configuration formats. Because JSON is often used for both, these two use cases often get conflated.

For configuration formats, I 100% agree with you. I do not want any data type except a string and a hashmap (maybe an array if you're being luxurious). Not an int, not a float, not a boolean, not a datetime (looking at you, TOML). For configuration formats I am always immediately feeding those files into a language with a richer type system that will actually parse them; my program and its embedded types are the schema. (Users of dynamically-typed languages may reasonably disagree.)

However, for the serialization use case, I'm not so sure. There's an argument that having a schema against which to do lightweight validation at several points in the pipeline isn't the worst idea, and built-in primitives get you halfway to a half-decent schema. I'm ambivalent at worst.

troupo Jun 2, 2024

> my program and its embedded types are the schema.

They are not. Configuration is a very tiny subset of a more general problem that you also mention: serialization.

Your config file will be de-serialized by your program and parsed into some specific types. Including numbers (tons of edge cases), dates (tons of edge cases), strings (tons of edge cases) etc.

It becomes worse when your program is used by more people than just you: which field is a date? In which format? Do you handle floats? What precision? What's the decimal separator? Do you do string normalization? What are valid and invalid characters, if any?

You can't pretend that your config is "just strings". They are not

mike_hock Jun 2, 2024

I kind of took away the opposite from the parent post. Of course, your config isn't just strings, but it also isn't just a limited set of primitive types that the inventor of some one-size-fits-all configuration language envisioned.

You can't build a generic schema validator that will accept exactly the valid configs for some program and nothing else anyway, so forget the half-assed type checking attempts and just provide the hierarchical structure. It's up to the application to define the valid grammar and semantics of each config option and parse it into an application-specific type.

troupo Jun 3, 2024

That's why every time I run into a program-specific config I curse the developer because there's no way of knowing what exactly a particular program (or a framework) needs :)

wruza Jun 2, 2024

But most configs are just strings and it’s okay. How does it get so bad just itt?

Human input is full of tradeoffs, that’s why it’s bash and not typescript in your shell path column. And you’ll meet a great resistance from users if you make your config fully typed and require to refer to schema dtd ns or whatever bs xml had.

troupo Jun 3, 2024

> that’s why it’s bash and not typescript in your shell path column

Bash is there purely for historical reasons. And it sucks.

> And you’ll meet a great resistance from users if you make your config fully typed and require to refer to schema dtd ns or whatever bs xml had.

That schema can and will help editors to validate and autocomplete things on the fly, and can also serve as a reference for what actual data the config accepts.

hgyjnbdet Jun 2, 2024

I would say all configs should be treated as castable strings. That's why for config files I much prefer the INI format.

nevermore24 Jun 2, 2024

The strings are strings. I don't care how people handle their dates, that's between them and their god.

crazygringo Jun 2, 2024

> Almost all JSON consumers are either deserializing to a spec that already contains type information

But different languages interpret different strings in different ways by default.

This leads to major bugs.

One of the great strengths of JSON is that parsing a number is well-defined.

The way you're suggesting would lead to people emitting JSON with leading zeros sometimes, and then some languages end up interpreting certain numbers as octal.

No thank you.

anonymoushn Jun 2, 2024

JSON numbers are just certain strings, but some tools that deal with json such as jq feel a need to mangle the numbers anyway

crazygringo Jun 2, 2024

I don't know what you mean.

JSON numbers are far more restrictive than strings and carry precisely defined meaning in a way that arbitrary strings don't. They're only "just certain strings" in the same way anything can be serialized to a string, which doesn't really mean anything.

What does jq do to them?

anonymoushn Jun 3, 2024

It replaces them with different numbers, even if you don't try to do math on them :)

  echo 1.4e99999999999999 | jq
  1.7976931348623157e+308

While I agree that the meaning of json numbers exists, I'm not sure which JSON standard you're referring to that contains this meaning. json.org certainly does not contain it, and links to ECMA-404, which just says "JSON is agnostic about the semantics of numbers."

wwader Jun 3, 2024

jq tries to preserve number precision if you don't do operation with them, but as you noted this is within some sanity. If you do operations the involved numbers will first be converted to binary64 (aka double), same as node and most other languages. This is what is recommended by RFC 7159 for interoperability.

VMG Jun 2, 2024

Disagree. The typical ad hoc funcs for parsing string to bool make me despair (uppercase, lowercase, true, yes, y, 1, .. )

sgarland Jun 2, 2024

Python’s distutils had a strtobool() function that was very handy for this, but the module has been removed. It’s trivial to re-implement, but still slightly annoying to have to do.

IshKebab Jun 2, 2024

It's extremely common in dynamically typed languages to deserialise JSON without a spec. What you're asking for is basically XML and it's definitely nicer to get at least basic types (string, bool, int, etc.) "for free".

kemitchell Jun 2, 2024

I've long used a toylike "Lists and Maps of Strings" format for personal recordkeeping and automation. https://www.npmjs.com/package/lamos

I've never gone back to formalize the grammar or otherwise mature it. But it's served me well as-is, and it's been easy to convert "up" to JSON or YAML or XML or what-have-you, once the case for an interface beyond plain text proves worthwhile.

fuzztester Jun 2, 2024

>It would be much simpler if all primitives were strings,

TCLON?

jsnell Jun 2, 2024

(2016). Previous significant discussions:

https://www.hackerneue.com/item?id=12796556

https://www.hackerneue.com/item?id=20724672

https://www.hackerneue.com/item?id=28826600

dang Jun 2, 2024

Thanks! Macroexpanded:

Parsing JSON Is a Minefield (2016) - https://www.hackerneue.com/item?id=28826600 - Oct 2021 (173 comments)

Parsing JSON Is a Minefield (2018) - https://www.hackerneue.com/item?id=20724672 - Aug 2019 (178 comments)

Parsing JSON is a Minefield - https://www.hackerneue.com/item?id=16897061 - April 2018 (246 comments)

Parsing JSON is a Minefield - https://www.hackerneue.com/item?id=12796556 - Oct 2016 (292 comments)

aftbit Jun 2, 2024

I'm a little sad to see that no implementations of JavaScript were on the tested parser list. I'd be interested to see where browsers and nodejs `JSON.parse` as well as `eval` parsers fall. As the author mentioned, some of the JSON features are not valid JavaScript but I wonder which of these test cases fail `eval`.

Note, just so nobody reminds me, don't parse JSON with eval for security reasons. I'm just curious how it would work from a parser completeness point of view.

theamk Jun 2, 2024

This is interesting, but seems pretty irrelevant for the real world (kinda like "i = ++i + ++i;" C puzzle). The answer to those dangers is "don't do it then". Use your stdlib to emit json, don't use string functions to modify json, assume any number is no better tha float64, and base64 your binary data - and you will never have to worry about this "minefield"

(the only possible problem is if you are designing a security system, but even then, since all the ambiguity is whether to reject the string, it will cause DOS at worst)

zzo38computer Jun 2, 2024

There is problem with JSON, such as:

- The numbers is floating points, but cannot be Infinity and NaN. It is not a integer type, so long integers might not work properly. (There are other problems with numbers too, as mentioned in that article.)

- The strings is Unicode. Non-Unicode (including binary data) doesn't do properly, and even Unicode can have problems (some of which are mentioned in that article, but there are others too).

- Keys are only strings, not numbers.

- Syntax convenience isn't so well, e.g. doesn't have comments, optional trailing commas, etc.

- The format is difficult for reasons explained in that article, too.

One possible alternative would be a format based on a subset of PostScript (instead of JavaScript), e.g. (a part of a example from Wikipedia):

  <<
    /first_name (John)
    /last_name (Smith)
    /is_alive true
    /age 27
    /phone_numbers [
      <<
        /type (home)
        /number (212 555-1234)
      >>
      <<
        /type (office)
        /number (646 555-4567)
      >>
    ]
    /spouse null
  >>

PostScript also has binary format, comments (with a percentage sign), hex string literals, etc. (And, commas are not used, so the problem with trailing commas also does not apply.)

(Nevertheless, I did write a JSON parser (and also a JSON writer) in PostScript.)

It is also possible to use binary formats, CSV, etc, depending on what exactly is needed by the program; for many reasons, one format cannot solve everything.

BugsJustFindMe Jun 2, 2024

> The numbers is floating points...long integers might not work properly

I personally hate the usual interpretation as float and see it as a common but extremely-implementation-induced failure. It's far better interpreted as an arbitrary precision numeric type, not float or int. The spec even says as much and only says that implementations mostly suck so watch out. IMO precision myopia is why we end up with e.g. Python's refusal-by-default to (de)serialize from/to Decimal.

nurettin Jun 2, 2024

Why not make non-strict parsers that will handle unicodes, longs, binary, ignore comments and allow trailing commas? If you set bend_over_backwards=true, it will do strict parsing for the poor souls who need that.

edit: I didn't mention integer keys, because object members canonically start with a letter.

dragonwriter Jun 2, 2024

> The numbers is floating points,

This is not true, JSON numbers are simply signed decimal numbers. They might be parsed into floating point (as is the case with JavaScript), or any other numeric type, which makes them unreliable without additional constraints beyond what JSON specifies.

agys Jun 2, 2024

> Syntax convenience isn't so well, e.g. doesn't have comments, optional trailing commas, etc.

I never understood these two choices in the spec as they are totally against the “human-readable” goal…

chrisjj Jun 3, 2024

> There is problem with JSON, such as:

>- The numbers is floating points, but cannot be Infinity and NaN.

The numbers are in fact real. Infinity and NaN are not reals.

sureglymop Jun 2, 2024

Wrote a json parser recently and did not think this hard about it (because the spec is so simple). Time to revisit

RedShift1 Jun 2, 2024

Parsing any format is a minefield though...

jwells89 Jun 2, 2024

JSON has its issues, but modern languages including facilities to work with it (fewer dependencies to wrangle is always great) and the way typesafe (de)serializers can be synthesized automatically with a little thoughtfulness in design instead of needing to be manually written (see Swift’s Codable and Kotlin/Java’s Moshi, for example) can in my opinion make it compelling enough to overlook its warts. It doesn’t fit everywhere of course but it’s more than good enough for a vast range of applications.

mariusor Jun 2, 2024

Funny how Baader-Meinhof works, I just finished writing a JSON toy parser earlier today. I guess I'll add the mentioned corner cases to the testsuite, and watch them fail. :D

hughw Jun 2, 2024

And so I just now learned that the Baader-Meinhof Gang of the 1970s gave its name to the phenomenon of frequency illusion, where once you hear about a thing you notice many more references to it.

stevejb Jun 2, 2024

I definitely agree with a lot of the comments here especially the ones in the vein of “don’t do dangerous things with json. “ if you have control of the sender and the receiver, it makes sense to have fields that add a bit of extra type information e.g. this is an integer or this is a float with this much precision

acheong08 Jun 2, 2024

At that point just use protobuf

ryjo Jun 2, 2024

Writing a JSON parser is a good way to teach yourself better programming practices. I attribute my understanding of pointer arithmetic and i/o streams to my own efforts in parsing/generating JSON.

thecleaner Jun 2, 2024

Btw if we use parser generators like antlr for this purpose, is it still a minefield ? Can someone point some vulnerabilities I can study ?

Thaxll Jun 2, 2024

XML was indeed better.

IshKebab Jun 2, 2024

It absolutely wasn't, primarily because the XML data model is so mismatched with the object structures you find in programming languages.

It does at least support comments though. Biggest flaw in JSON by far.

wtetzner Jun 2, 2024

> primarily because the XML data model is so mismatched with the object structures you find in programming languages

I dunno, it matches up reasonably well with languages that have nestable custom types.

XML labels nodes, and JSON labels edges. They both have pluses and minuses.

IshKebab Jun 3, 2024

XML nodes have attributes and contents. No programming language I know of works like that.

wtetzner Jun 3, 2024

It doesn't seem like an especially important distinction. I've never understood why people always make such a big deal about it.

IshKebab Jun 3, 2024

It's a big pain when decoding XML. That's not the only impedance mismatch. The fact that XML is just a soup of objects is not how programming language objects work either.

https://docs.rs/serde-xml-rs/0.6.0/serde_xml_rs/#caveats

Look at how much more complex this is than the equivalent JSON code, which requires none of these annotations:

https://docs.rs/strong-xml/latest/strong_xml/

rixed Jun 3, 2024

Yet still very inefficient.

Tao3300 Jun 2, 2024

Apples and oranges.

rixed Jun 3, 2024

To me XML and JSON are actually technically quite similar, and both have been touted in their heyday for similar reasons over more apt format (auto-documented! simple to implement an ad-hoc serializer!).

Would you mind explaining why you thinks it is an apple and oranges comparison?

Tao3300 Jun 3, 2024

One is a markup language. The other is an object notation.

Clamchop Jun 3, 2024

I'd say that's part of the calculus for comparison. Comparing apples and oranges isn't nonsensical if what you need is a fruit. XML and JSON are alternatives for many use cases, so you can evaluate if markup and schemas are valuable enough to be worth the added hassle.

chrisjj Jun 2, 2024

Nice.

Feedback:

> I wrote yet another JSON parser (section 6)

Link defunct.

douglee650 Jun 2, 2024

```

One day a student came to Moon and said: “I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons.”

Moon patiently told the student the following story:

“One day a student came to Moon and said: ‘I understand how to make a better garbage collector...

```

throwaway984393 Jun 2, 2024 (dead)

This item has no comments currently.