> I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well?
It's not that UTF-8 is because of JavaScript, it's that indexing by bytes instead of UTF-8 code units is because of JavaScript. To use UTF-8 in JavaScript, you can use TextEncoder/TextDecoder, which return the string as a Uint8Array, which is indexed by bytes.
So if you have a string "Cześć, #Bluesky!" and you want to mark the "#Bluesky" part with a hashtag link facet, the index range is 9...17 (bytes), and not 7...15 (scalars).
When the encoding is UTF-8 (which it is here), the code unit is the byte.
They called the fields byteStart and byteEnd, but a more technically precise (no more or less accurate, but more precise) labels would be utf8CodeUnitStart and utf8CodeUnitEnd.
You may not have seen this interesting article before: https://hsivonen.fi/string-length/. I agree with its assessment that scalars are really pretty useless as a measure, and Python and Ruby are foolish to have chased it at such expense.
But seriously, I can’t think of any other popular languages that count by scalars or code points—it’s definitely not most languages, it’s a minority, all a very specific sort of language. “Most” encompasses well-formed UTF-8 (e.g. Rust), recommended UTF-8 but it doesn’t actually care (e.g. Go), potentially ill-formed UTF-16 (e.g. JavaScript, Java, .NET), and total-mess (e.g. C, C++).
Very few do. Of moderately popular languages, Python is the only one I can think of. Well, Python strings are actually sequences of code points rather than scalars, which is a huge mistake, but provided your strings came from valid Unicode that doesn’t matter.
Languages like Rust and Swift make it fairly easy to access your string by UTF-8 or by scalar.
Languages like Java and JavaScript index by UTF-16 code unit and make anything else at least moderately painful.
> This is somewhat of an unfortunate tech debt thing as I understand, and it was made this way mostly because of JavaScript, which doesn’t work with UTF-8 natively. But this means you need to be extra careful with the indexes in most languages.
I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well? If it were indexed by UTF-16 code unit, I’d agree, that’s bad tech debt; but that’s not the case here.
Bluesky made the decision to go all in on UTF-8 here <https://docs.bsky.app/docs/advanced-guides/post-richtext#tex...>—after all, the strings are being stored and transferred in UTF-8, and UTF-8 is increasingly the tool of choice, and UTF-16 is increasingly reviled, almost nothing new has chosen it for twenty years, and nothing major has chosen it for ten years, it’s all strictly legacy. Hugely popular legacy, sure, but legacy.