Comment by _praf - Hacker Neue

_praf Jul 21, 2025 parent

Love this as a real world benchmark!

How much prompt iteration did you do? I've noticed when building real world agentic apps that small prompt tweaks can make a huge difference in behavior (re: the reward hacking vs hallucinating). Would love to learn more about the approach here.

shanktt Jul 22, 2025

Hey, member of the benchmark team. We iterated on the prompts based on observed model behaviors. A few key examples:

Schema introspection: Models were spending significant tokens exploring the database structure through trial-and-error SQL queries, so we included the complete data model in the system prompt upfront.

Reward hacking: We added explicit instructions against gaming the reconciliation checks. This reduced the frequency initially, but models would eventually ignore these constraints.

Domain context: Including company background (YC-backed startup) substantially improved transaction categorization, particularly for startup-specific items like SAFE notes that require domain knowledge to classify correctly.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous