My intuition for NoPE was that the presence of the causal mask provides enough of a signal to implicitly distinguish token position. If you imagine the flow of information in the transformer network, tokens later on in the sequence "absorb" information from the hidden states of previous tokens, so in this sense you can imagine information flowing "down (depth) and to the right (token position)", and you could imagine the network learning a scheme to somehow use this property to encode position.
NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax).
There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term).
This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation.
Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.
EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955