The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.
Protobuf and similar are binary formats so don't have this limitation.
> The problem is that formats like JSON are designed to be human readable and writable. Length prefixing is a non starter here.
Canonical S-expression are both human-readable & length-prefixed. They do this by have an advanced representation which is human-friendly:
(data (looks "like this" |YWluJ3QgaXQgY29vbD8=|))
And a canonical representation which is length-prefixed: (4:data(5:looks9:like this14:ain't it cool?))
The PHP serialisation format has many issues, especially since it allows all sorts of PHP data structures to be encoded. This allows defining references and serializing objects using custom routines into arbitrary binary blobs. Also PHP's unserialization can be used to trigger the autoloader as it tries to resolve unloaded classes, which can trigger unsafe routines in those.
Certainly no data format for data exchange between systems, especially untrusted sources.
You shouldn't use PHP's unserialize implementation with untrusted sources; but my point was that its format makes it relatively simple to parse vs json or xml where you have to do a lot of work to parse strings. If you're writing your own parser (including a parser for another language), you could decide to only parse basic types (bool, int, float, array); if you're designing your own format, you could take the lesson of length prefixed strings are much easier to use for computers than delimited strings.
PHP serialization is better here, everything is type:value or type:length:value, although strings do have quotes around them, because their byte length is known, internal quotes need not be escaped. You can still have issues with genrating and parsing the human readible numbers properly (floating point is always fun, and integers may have some bit size limit I don't recall), but you don't need to worry about quoting Unicode values properly.
Protocol buffers have clear length indications, so that's easier, but it's not a 'self documenting' format, you need to have the description file to parse an encoded value. The end result is usually many fewer bits though.