> The fact that they went out of their way to break python 2 unicode when running on python 3 was just totally nuts. Especially after making such a big deal about unicode!
Imo it's infinitely worse than that.
The big deal about Unicode is its nature, as defined in the "Summary Narrative" from 1991[0]. To wit:
> The Unicode character encoding derives its name from three main goals:
* universal (addressing the needs of world languages)
* uniform (fixed-width codes for efficient access), and
* unique (bit sequence has only one interpretation into character codes)
The Unicode folk realized that it would take decades to shift developers worldwide to doing that properly, so they adopted a three stage plan for software (eg the string types of programming languages) to get from where things were, to where they needed to be:
* Stage #1: Character = byte
* Stage #2: Character = code point
* Stage #3: Character = what a user thinks of as a character[1]
Python 1 was a Stage #1 language -- Character = byte -- like most others of its time.
In Python 2 there were tweaks to try move toward Stage #2 -- Character = code point, again, like most other PLs of its time.
In Python 3, they dictated a full switch to Stage #2 --- Character = code point. That was an unnecessarily painful break relative to Python 2. But -- and this is what really matters -- they entirely ignored Stage #3, which is the whole point of Unicode in the final analysis.
Imo it's infinitely worse than that.
The big deal about Unicode is its nature, as defined in the "Summary Narrative" from 1991[0]. To wit:
> The Unicode character encoding derives its name from three main goals:
* universal (addressing the needs of world languages)
* uniform (fixed-width codes for efficient access), and
* unique (bit sequence has only one interpretation into character codes)
The Unicode folk realized that it would take decades to shift developers worldwide to doing that properly, so they adopted a three stage plan for software (eg the string types of programming languages) to get from where things were, to where they needed to be:
* Stage #1: Character = byte
* Stage #2: Character = code point
* Stage #3: Character = what a user thinks of as a character[1]
Python 1 was a Stage #1 language -- Character = byte -- like most others of its time.
In Python 2 there were tweaks to try move toward Stage #2 -- Character = code point, again, like most other PLs of its time.
In Python 3, they dictated a full switch to Stage #2 --- Character = code point. That was an unnecessarily painful break relative to Python 2. But -- and this is what really matters -- they entirely ignored Stage #3, which is the whole point of Unicode in the final analysis.
[0] https://www.unicode.org/history/summary.html
[1] https://unicode.org/glossary/#grapheme