mhjkl parent
Afaik most LLM datasets use FastText or something similar to detect the language of the data and if it's spam, and some additional small language models to detect if text is "educational" or desirable in some other way. Often text is filtered in instead of filtered out, so anything unusual like this probably won't pass the filter, you don't need to detect it explicitly.