Ph.D. in neuroscience here. Programmer by trade. This is not true. Less you know about most peer revies is better.
The better peer reviews are also not this 'thorough' and no one expects reviewers to read or even check references. Unless they are citing something they are familiar with and you are using it wrong then they will likely complain. Or they find some unknown citations very relevant to their work, they will read.
I don't have a great analogy to draw here. peer review is usually a thankless and unpaid work so there is unlikely to be any motivation for fraud detection unless it somehow affects your work.
Checking references can be useful when you are not familiar with the topic (but must review the paper anyway). In many conference proceedings that I have reviewed for, many if not most citations were redacted so as to keep the author anonymous (citations to the author's prior work or that of their colleagues).
LLMs could be used to find prior work anyway, today.
Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.
So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.
As a PR reviewer I frequently pull down the code and run it. Especially if I'm suggesting changes because I want to make sure my suggestion is correct.
Do other PR reviewers not do this?
E.g. you can imagine that if I'm reviewing changes in authentication logic, I'm obviously going to put a lot more effort into validation than if I'm reviewing a container and wondering if it would be faster as a hashtable instead of a tree.
> because I want to make sure my suggestion is correct.
In this case I would just ask "have you already also tried X" which is much faster than pulling their code, implementing your suggestion, and waiting for a build and test to run.
And even then, what you're describing isn't review per se, it's replication. In principle there are entire journals that one can submit replication reports to, which count as actual peer reviewable publications in themselves. So one needs to be pragmatic with what is expected from a peer review (especially given the imbalance between resources invested to create one versus the lack of resources offered and lack of any meaningful reward)
Machine learning conferences generally encourage (anonymized) submission of code. However, that still doesn't mean that replication is easy. Even if the data is also available, replication of results might require impractical levels of compute power; it's not realistic to ask a peer reviewer to pony up for a cloud account to reproduce even medium-scale results.
No, because this is usually a waste of time, because CI enforces that the code and the tests can run at submission time. If your CI isn't doing it, you should put some work in to configure it.
If you regularly have to do this, your codebase should probably have more tests. If you don't trust the author, you should ask them to include test cases for whatever it is that you are concerned about.
Reviewers wanting to pull and run many PRs makes me think your automated tests need improvement.
So running it myself involves judging other risks, much higher-level ones than bad unicode characters, like the GUI button being in the wrong place.
Some do, many, (like peer reviewers), are unable to consider the consequences of their negligence.
But it's always a welcome reminder that some people care about doing good work. That's easy to forget browsing HN, so I appreciate the reminder :)
No it's not. I think you're trying to make a different point, because you're using an example of a specific deliberate malicious way to hide a token error that prevents compilation, but is visually similar.
> and you as a code reviewer are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.
What weird world are you living in where you don't have CI. Also, it's pretty common I'll test code locally when reviewing something more complex, more complex, or more important, if I don't have CI.
> Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.
I don't, because it won't compile. Not because I assume good faith. References and citations are similar to introducing dependencies. We're talking about completely fabricated deps. e.g. This engineer went on npm and grabbed the first package that said left-pad but it's actually a crypto miner. We're not talking about a citation missing a page number, or publication year. We're talking about something that's completely incorrect, being represented as relevant.
> So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.
I would never miss this, because the important thing is code needs to compile. If it doesn't compile, it doesn't reach the master branch. Peer review of a paper doesn't have CI, I'm aware, but it's also not vulnerable to syntax errors like that. A paper with a fake semicolon isn't meaningfully different, so this analogy doesn't map to the fraud I'm commenting on.
breaking the analogy beyond the point where it is useful by introducing non-generalising specifics is not a useful argument. Otherwise I can counter your more specific non-generalising analogy by introducing little green aliens sabotaging your imaginary CI with the same ease and effect.
But I agree, because I'd rather discuss the pragmatics and not bicker over the semantics about an analogy.
Introducing a token error, is different from plagiarism, no? Someone wrote code that can't compile, is different from someone "stealing" proprietary code from some company, and contributing it to some FOSS repo?
In order to assume good faith, you also need to assume the author is the origin. But that's clearly not the case. The origin is from somewhere else, and the author that put their name on the paper didn't verify it, and didn't credit it.
The point is what is expected as reasonable review before one can "sign their name on it".
"Lazy" (or possibly malicious) authors will always have incentives to cut corners as long as no mechanisms exist to reject (or even penalise) the paper on submission automatically. Which would be the equivalent of a "compiler error" in the code analogy.
Effectively the point is, in the absence of such tools, the reviewer can only reasonably be expected to "look over the paper" for high-level issues; catching such low-level issues via manual checks by reviewers has massively diminishing returns for the extra effort involved.
So I don't think the conference shaming the reviewers here in the absence of providing such tooling is appropriate.
Then you can build a true hierarchy of citation dependencies, checked 'statically', and have better indications of impact if a fundamental truth is disproven, ...
Could you provide a proof of concept paper for that sort of thing? Not a toy example, an actual example, derived from messy real-world data, in a non-trivial[1] field?
---
[1] Any field is non-trivial when you get deep enough into it.
totally agree with your thinking here, we can't just give this to an LLM, because of the need to have industry-specific standards for what is a hallucination / match, and how to do the search
One could submit their bibtex files and expect bibtex citations to be verifiable using a low level checker.
Worst case scenario if your bibtex citation was a variant of one in the checker database you'd be asked to correct it to match the canonical version.
However, as others here have stated, hallucinated "citations" are actually the lesser problem. Citing irrelevant papers based on a fly-by reference is a much harder problem; this was present even before LLMs, but this has now become far worse with LLMs.
the tool gptzero used in the article also detects if the citation supports the claim too, if you scroll to "cited information accuracy" here: https://app.gptzero.me/documents/1641652a-c598-453f-9c94-e0b...
this is still in beta because its a much harder problem for sure, since its hard to determine if a 40 page paper supports a claims (if the paper claims X is computationally intractable, does that mean algorithms to compute approximate X are slow?)
1. A patch is self-contained and applies to a codebase you have just as much access to as the author. A paper, on the other hand, is just the tip of the iceberg of research work, especially if there is some experiment or data collection involved. The reviewer does not have access to, say, videos of how the data was collected (and even if they did, they don't have the time to review all of that material).
2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.
Given the repeatability crisis I keep reading about, maybe something should change?
> 2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.
This is a much, MUCH stronger point. I would have lead with this because the contrast between this assertion, and my comparison to prod is night and day. The rules for prod are different from the rules of scientific consensus. I regret losing sight of that.
The replication crisis — assuming that it is actually a crisis — is not really solvable with peer review. If I'm reviewing a psychology paper presenting the results of an experiment, I am not able to re-conduct the entire experiment as presented by the authors, which would require completely changing my lab, recruiting and paying participants, and training students & staff.
Even if I did this, and came to a different result than the original paper, what does it mean? Maybe I did something wrong in the replication, maybe the result is only valid for certain populations, maybe inherent statistical uncertainty means we just get different results.
Again, the replication crisis — such that it exists — is not the result of peer review.
Even if peer review is as rigorous as code reviewed (the former which is usually unpaid), we all know that reviewed code still has bugs, and a programmer would be nuts to go around saying "this code is reviewed by experts, we can assume it's bug free, right?"
But there are too many people who are just assuming peer reviewed articles means they're somehow automatically correct.
Correct. Peer review is a minimal and necessary but not sufficient step.
I mean, being peer reviewed is a signal of a paper's quality, but in the hands of an expert in that domain it's not a very valuable signal, because they can just read the paper themselves, and figure out whether it's legit. So instead of having "experts" try to explain a paper and commenting on whether it's peer reviewed or not, I think the better practice is to have said expert say "I read the paper and it's legit", or "I read the paper and it's nonsense".
IMHO the reason they make note of whether it's peer reviewed is because they don't know enough to make the judgement themselves. And the fallback is to trust a couple anonymous reviewers attest to the quality of a paper! If you think of it that way, using this signal to vet the quality of a publication to the lay public isn't really a good idea.
Yeah its insane the workload reviewers are faced with + being an author who gets a review from a novice
No.
Modern peer review is “how can I do minimum possible work so I can write ‘ICLR Reviewer 2025’ on my personal website”
I don't know, I still think this describes most of the reviews I've seen
I just hope most devs that do this know better than to admit to it.
I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?
> They are also assuming good faith.
I can only relate this to code review, but assuming good faith means you assume they didn't try to introduce a bug by adding this dependency. But I would should still check to make sure this new dep isn't some typosquatted package. That's the rigor I'm responsible for.