Oh, so you didnt run the repo and remembered something that you read once that looked like it matched. This contribution is meaningless.
The simplest way to resolve any doubt is to run the code. Every result in the paper comes from reproducible scripts in the repo, not from speculative reasoning or LLM-assisted invention.
Your EDIT. The first thing it suggested is actually very similar to ensembles in meteorology. I actually find myself doing that often if it's something extremely important. Just feels natural to cross-check with other models or with reality. The disclaimer says it may make mistakes after all...
Like you don't predict the weather or a hurricane track with a single model. The NHC uses many.
It's still probablistic, but if multiple models are independently in agreement, then it's at least worth investigating further.
EDIT: Found a closer description ("Your LLM-assisted scientific breakthrough probably isn't real"): https://www.lesswrong.com/posts/rarcxjGp47dcHftCP/your-llm-a...