ntonozzi parent
The most effective architecture for LLMs without tokenization is Meta's MEGABYTE architecture: https://arxiv.org/abs/2305.07185. The sacrifice here is that they have two different types of layers, instead of the simple uniform transformer layers.