Comment by toast0 - Hacker Neue

toast0 Apr 22, 2018 parent

My experience with json and similar formats is that most of the complexity arrises from using delimited strings instead of length prefixed strings, and the exciting escaping that results. If the strings are character strings instead of byte strings, you get to add an extra layer of character encoding excitement.

PHP serialization is better here, everything is type:value or type:length:value, although strings do have quotes around them, because their byte length is known, internal quotes need not be escaped. You can still have issues with genrating and parsing the human readible numbers properly (floating point is always fun, and integers may have some bit size limit I don't recall), but you don't need to worry about quoting Unicode values properly.

Protocol buffers have clear length indications, so that's easier, but it's not a 'self documenting' format, you need to have the description file to parse an encoded value. The end result is usually many fewer bits though.

ChrisSD Apr 22, 2018

The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.

Protobuf and similar are binary formats so don't have this limitation.

zeveb Apr 24, 2018

> The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.

Canonical S-expression are both human-readable & length-prefixed. They do this by have an advanced representation which is human-friendly:

    (data (looks "like this" |YWluJ3QgaXQgY29vbD8=|))

And a canonical representation which is length-prefixed:

    (4:data(5:looks9:like this14:ain't it cool?))

johannes1234321 Apr 22, 2018

The PHP serialisation format has many issues, especially since it allows all sorts of PHP data structures to be encoded. This allows defining references and serializing objects using custom routines into arbitrary binary blobs. Also PHP's unserialization can be used to trigger the autoloader as it tries to resolve unloaded classes, which can trigger unsafe routines in those.

Certainly no data format for data exchange between systems, especially untrusted sources.

toast0 OP Apr 23, 2018

You shouldn't use PHP's unserialize implementation with untrusted sources; but my point was that its format makes it relatively simple to parse vs json or xml where you have to do a lot of work to parse strings. If you're writing your own parser (including a parser for another language), you could decide to only parse basic types (bool, int, float, array); if you're designing your own format, you could take the lesson of length prefixed strings are much easier to use for computers than delimited strings.

This item has no comments currently.