Comment by mhjkl - Hacker Neue

mhjkl Nov 24, 2025 parent

Afaik most LLM datasets use FastText or something similar to detect the language of the data and if it's spam, and some additional small language models to detect if text is "educational" or desirable in some other way. Often text is filtered in instead of filtered out, so anything unusual like this probably won't pass the filter, you don't need to detect it explicitly.

This item has no comments currently.

It looks like you have JavaScript disabled. This web app requires that JavaScript is enabled. Please enable JavaScript to use this site (or just go read Hacker News).

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous