I can imagine that most LLMs, if you ask it to find a security vulnerability in a given piece of code, will make something up completely out of the air. I've (mistakenly) sent valid code with an unrelated error and to this day I get nonsense "fixes" for these errors.
This alignment problem between responding with what the user wants (e.g. a security report, flattering responses) and going against the user seems a major problem limiting the effectiveness of such systems.
This alignment problem between responding with what the user wants (e.g. a security report, flattering responses) and going against the user seems a major problem limiting the effectiveness of such systems.