> I’m not sure an LLM can really capture project-specific context yet from a single PR diff.
We had an even more expensive approach that cloned the repo into a VM and prompted codex to explore the codebase and run code before returning the heatmap data structure. Decided against it for now due to latency and cost, but I think we'll revisit it to help the LLM get project context.
Distillation should help a bit with cost, but I haven't experimented enough to have a definitive answer. Excited to play around with it though!
> which parts of the code change most often or correlate with past bugs
I can think of a way to do the correlation that would require LLMs. Maybe I'm missing a simpler approach? But agree that conditioning on past bugs would be great
As for interactive reviews, one workflow I’ve found surprisingly useful is letting Claude Code simulate a conversation between two developers pair-programming through the PR. It’s not perfect, but in practice the dialogue and clarifying questions it generates often give me more insight than a single shot LLM summary. You might find it an interesting pattern to experiment with once you revisit the more context-aware approaches.
At first I thought this to but now I doubt that's a good heuristic. That's probably where people would be careful and/or look anyway. If I were to guess, regressions are less likely to occur in "hotspots".
But this is just a hunch. There are tons of well reviewed and bug reported open source projects, would be interesting if someone tested it.
I mean these tools are fine. But let's be on the same page that they can only address a sub-class of problems.
I’m not sure an LLM can really capture project-specific context yet from a single PR diff.
Honestly, a simple data-driven heatmap showing which parts of the code change most often or correlate with past bugs would probably give reviewers more trustworthy signals.