Comment by rendaw - Hacker Neue

rendaw Jun 2, 2024 parent

The primitive types JSON specifies are redundant and generally only lead to issues. Almost all JSON consumers are either deserializing to a spec that already contains type information, frequently richer, with even more variety of types (url, telephone number, UUID, not just "string"), and even without a spec code will be written to need a specific type (i.e. you're not going to write code to accept an integer when you want a person's name).

It would be much simpler if all primitives were strings, and it'd probably save a few people from accidentally doing the wrong thing while dealing with prices.

aftbit Jun 2, 2024

Perhaps. I've often wished that JSON supported some sort of custom types or type annotations, or failing that, at least datetimes. Some other nice extensions would be support for comments and optional trailing commas.

There is something very nice and expressive about the existing JSON types. Just 6 types (null, boolean, string, number, array, and dictionary) are enough to cover a ton of use cases, and as you suggest, one can always fall back to "stringly typed" alternatives by implementing one's own serialization and deserialization for extra types.

ooterness Jun 2, 2024

You may be interested in CBOR (IETF RFC 8949).

CBOR features are almost one-to-one with JSON, except that the encoding is more size-efficient, it supports a few additional types (e.g., integers and floats are separate), and it allows semantic tags.

https://en.wikipedia.org/wiki/CBOR

zzo38computer Jun 2, 2024

There are some benefits of CBOR (having a separate integer type is good, and a byte string type is good, and they have typed numeric arrays which is good also, etc), but also some problems. For example, I might have preferred that Unicode is a tag rather than a type (other tags can be used for other character sets), and base64-encoded strings also seems unnecessary (since it is a binary format anyways, you should just use the binary data directly instead), and I think it would be better for a MIME message to be treated as a byte string instead of Unicode (fortuantely the specification allows that, but it seems to just be "added on" afterward due to a lack of consideration), and possibly maybe it might be better to disallow the types of keys to be arrays and maps.

However, some of the things I mentioned above, do have benefits for interoperability with JSON, although they aren't good for a general-purpose use; I think that it would generally be better to make a good format rather than trying to work only with the bad ideas of other specifications. (Fortunately, I think what I described above could be implemented using a subset of CBOR.)

However, using these formats (whether CBOR or JSON) is often more complicated than should be needed for a specific use anyways.

murmansk Jun 2, 2024

While it might be great in theory, CBOR has own separate set of dragons waiting for you.

Expectation: tags in CBOR allow you to pass semantics. Reality: multitude of tags, and absence of strict rules for the tags make it pain in the ass.

kibwen Jun 2, 2024

Let's make a distinction here between serialization formats and configuration formats. Because JSON is often used for both, these two use cases often get conflated.

For configuration formats, I 100% agree with you. I do not want any data type except a string and a hashmap (maybe an array if you're being luxurious). Not an int, not a float, not a boolean, not a datetime (looking at you, TOML). For configuration formats I am always immediately feeding those files into a language with a richer type system that will actually parse them; my program and its embedded types are the schema. (Users of dynamically-typed languages may reasonably disagree.)

However, for the serialization use case, I'm not so sure. There's an argument that having a schema against which to do lightweight validation at several points in the pipeline isn't the worst idea, and built-in primitives get you halfway to a half-decent schema. I'm ambivalent at worst.

troupo Jun 2, 2024

> my program and its embedded types are the schema.

They are not. Configuration is a very tiny subset of a more general problem that you also mention: serialization.

Your config file will be de-serialized by your program and parsed into some specific types. Including numbers (tons of edge cases), dates (tons of edge cases), strings (tons of edge cases) etc.

It becomes worse when your program is used by more people than just you: which field is a date? In which format? Do you handle floats? What precision? What's the decimal separator? Do you do string normalization? What are valid and invalid characters, if any?

You can't pretend that your config is "just strings". They are not

mike_hock Jun 2, 2024

I kind of took away the opposite from the parent post. Of course, your config isn't just strings, but it also isn't just a limited set of primitive types that the inventor of some one-size-fits-all configuration language envisioned.

You can't build a generic schema validator that will accept exactly the valid configs for some program and nothing else anyway, so forget the half-assed type checking attempts and just provide the hierarchical structure. It's up to the application to define the valid grammar and semantics of each config option and parse it into an application-specific type.

troupo Jun 3, 2024

That's why every time I run into a program-specific config I curse the developer because there's no way of knowing what exactly a particular program (or a framework) needs :)

wruza Jun 2, 2024

But most configs are just strings and it’s okay. How does it get so bad just itt?

Human input is full of tradeoffs, that’s why it’s bash and not typescript in your shell path column. And you’ll meet a great resistance from users if you make your config fully typed and require to refer to schema dtd ns or whatever bs xml had.

troupo Jun 3, 2024

> that’s why it’s bash and not typescript in your shell path column

Bash is there purely for historical reasons. And it sucks.

> And you’ll meet a great resistance from users if you make your config fully typed and require to refer to schema dtd ns or whatever bs xml had.

That schema can and will help editors to validate and autocomplete things on the fly, and can also serve as a reference for what actual data the config accepts.

hgyjnbdet Jun 2, 2024

I would say all configs should be treated as castable strings. That's why for config files I much prefer the INI format.

nevermore24 Jun 2, 2024

The strings are strings. I don't care how people handle their dates, that's between them and their god.

crazygringo Jun 2, 2024

> Almost all JSON consumers are either deserializing to a spec that already contains type information

But different languages interpret different strings in different ways by default.

This leads to major bugs.

One of the great strengths of JSON is that parsing a number is well-defined.

The way you're suggesting would lead to people emitting JSON with leading zeros sometimes, and then some languages end up interpreting certain numbers as octal.

No thank you.

anonymoushn Jun 2, 2024

JSON numbers are just certain strings, but some tools that deal with json such as jq feel a need to mangle the numbers anyway

crazygringo Jun 2, 2024

I don't know what you mean.

JSON numbers are far more restrictive than strings and carry precisely defined meaning in a way that arbitrary strings don't. They're only "just certain strings" in the same way anything can be serialized to a string, which doesn't really mean anything.

What does jq do to them?

anonymoushn Jun 3, 2024

It replaces them with different numbers, even if you don't try to do math on them :)

  echo 1.4e99999999999999 | jq
  1.7976931348623157e+308

While I agree that the meaning of json numbers exists, I'm not sure which JSON standard you're referring to that contains this meaning. json.org certainly does not contain it, and links to ECMA-404, which just says "JSON is agnostic about the semantics of numbers."

wwader Jun 3, 2024

jq tries to preserve number precision if you don't do operation with them, but as you noted this is within some sanity. If you do operations the involved numbers will first be converted to binary64 (aka double), same as node and most other languages. This is what is recommended by RFC 7159 for interoperability.

VMG Jun 2, 2024

Disagree. The typical ad hoc funcs for parsing string to bool make me despair (uppercase, lowercase, true, yes, y, 1, .. )

sgarland Jun 2, 2024

Python’s distutils had a strtobool() function that was very handy for this, but the module has been removed. It’s trivial to re-implement, but still slightly annoying to have to do.

IshKebab Jun 2, 2024

It's extremely common in dynamically typed languages to deserialise JSON without a spec. What you're asking for is basically XML and it's definitely nicer to get at least basic types (string, bool, int, etc.) "for free".

kemitchell Jun 2, 2024

I've long used a toylike "Lists and Maps of Strings" format for personal recordkeeping and automation. https://www.npmjs.com/package/lamos

I've never gone back to formalize the grammar or otherwise mature it. But it's served me well as-is, and it's been easy to convert "up" to JSON or YAML or XML or what-have-you, once the case for an interface beyond plain text proves worthwhile.

fuzztester Jun 2, 2024

>It would be much simpler if all primitives were strings,

TCLON?

This item has no comments currently.