Comment by chrismorgan

chrismorgan Aug 21, 2025 parent

> unicode scalars, which most languages index strings in

Very few do. Of moderately popular languages, Python is the only one I can think of. Well, Python strings are actually sequences of code points rather than scalars, which is a huge mistake, but provided your strings came from valid Unicode that doesn’t matter.

Languages like Rust and Swift make it fairly easy to access your string by UTF-8 or by scalar.

Languages like Java and JavaScript index by UTF-16 code unit and make anything else at least moderately painful.

> This is somewhat of an unfortunate tech debt thing as I understand, and it was made this way mostly because of JavaScript, which doesn’t work with UTF-8 natively. But this means you need to be extra careful with the indexes in most languages.

I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well? If it were indexed by UTF-16 code unit, I’d agree, that’s bad tech debt; but that’s not the case here.

Bluesky made the decision to go all in on UTF-8 here <https://docs.bsky.app/docs/advanced-guides/post-richtext#tex...>—after all, the strings are being stored and transferred in UTF-8, and UTF-8 is increasingly the tool of choice, and UTF-16 is increasingly reviled, almost nothing new has chosen it for twenty years, and nothing major has chosen it for ten years, it’s all strictly legacy. Hugely popular legacy, sure, but legacy.

psionides Aug 21, 2025

Hmm… Yeah, I guess each language does it kinda differently. At least Ruby also does it similarly like Python.

> I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well?

It's not that UTF-8 is because of JavaScript, it's that indexing by bytes instead of UTF-8 code units is because of JavaScript. To use UTF-8 in JavaScript, you can use TextEncoder/TextDecoder, which return the string as a Uint8Array, which is indexed by bytes.

So if you have a string "Cześć, #Bluesky!" and you want to mark the "#Bluesky" part with a hashtag link facet, the index range is 9...17 (bytes), and not 7...15 (scalars).

chrismorgan OP Aug 21, 2025

> indexing by bytes instead of UTF-8 code units

When the encoding is UTF-8 (which it is here), the code unit is the byte.

They called the fields byteStart and byteEnd, but a more technically precise (no more or less accurate, but more precise) labels would be utf8CodeUnitStart and utf8CodeUnitEnd.

psionides Aug 21, 2025

Sorry, I keep mixing these - bytes instead of scalars, which I think would be more natural to iterate over in most languages (at least the ones I use).

chrismorgan OP Aug 21, 2025

OK, checked and Ruby does seem to use scalars. Well, unless you mess with encodings. Then it’s messy. So it’s probably better and worse than Python 3.

You may not have seen this interesting article before: https://hsivonen.fi/string-length/. I agree with its assessment that scalars are really pretty useless as a measure, and Python and Ruby are foolish to have chased it at such expense.

But seriously, I can’t think of any other popular languages that count by scalars or code points—it’s definitely not most languages, it’s a minority, all a very specific sort of language. “Most” encompasses well-formed UTF-8 (e.g. Rust), recommended UTF-8 but it doesn’t actually care (e.g. Go), potentially ill-formed UTF-16 (e.g. JavaScript, Java, .NET), and total-mess (e.g. C, C++).

psionides Aug 21, 2025

Thanks, will have a read :)

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous