Preferences

samuelknight
Joined 103 karma
Founder & CTO at Vulnetic Inc. https://vulnetic.ai/

  1. These complaints are about technical limitations that will go away for codebase-sized problems as inference cost continues its collapse and context windows grow.

    There are literally hundreds of engineering improvements that we will see along the way like a intelligent replacement to compacting to deal with diff explosion, more raw memory availability and dedicated inference hardware, models that can actually handle >1M context windows without attention loss, and so on.

  2. You're Absolutely Right!
  3. Switching from my 8-core ryzen minipc to an 8-core ryzen desktop makes my unit tests run way faster. TDP limits can tip you off to very different performance envelopes in otherwise similar spec CPUs.
  4. It's good to be skeptical of new ideas as long as you don't box yourself in with dogmatism. If you're young you do this by looking at the world with fresh eyes. If you are experienced you do it by identifying assumptions and testing them.
  5. This is an interesting experiment that we can summarize as "I gave a smart model a bad objective", with the key result at the end

    "...oh and the app still works, there's no new features, and just a few new bugs."

    Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.

    In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.

  6. My startup builds agents for penetration testing, and this is the bet we have been making for over a year when models started getting good at coding. There was a huge jump in capability from Sonnet 4 to Sonnet 4.5. We are still internally testing Opus 4.5, which is the first version of Opus priced low enough to use in production. It's very clever and we are re-designing our benchmark systems because it's saturating the test cases.
  7. I use .md to tell the model about my development workflow. Along the lines of "here's how you lint", "do this to re-generate the API", "this is how you run unit tests", "The sister repositories are cloned here and this is what they are for".

    One may argue that these should go in a README.md, but these markdowns are meant to be more streamlined for context, and it's not appropriate to put a one-liner in the imperative tone to fix model behavior in a top-level file like the README.md

  8. "Gemini 3 Pro Preview" is in Vertex
  9. I look at LLMs with an engineering mindset. It is an intelligence black box that goes in a tool box with the old classical algorithms and frameworks. In order to use it in a solution I need to figure out:

    1) Whether I can give it information in a compatible and cost effective way

    2) Whether the model is likely to to produce useful output

    I have use language models for years before LLMs such as part of speech classifiers in the Python NLTK framework.

  10. My startup is building agents for automating pentesting. We started experimenting with Llama 3.1 last year. Pentesting with agents started getting good around Sonnet 3.5 v1.

    The switch from Sonnet 4 to 4.5 was a huge step change. One of our beta testers ran our agent on a production Active Directory network with ~500 IPs and it was able to privilege escalate to DA within an hour. I've seen it one-shot scripts to exploit business logic vulnerabilities. It will slurp down JS from websites and sift through for api endpoints, then run a python server to perform client side anaysis. It understands all of the common pentesting tools with minor guard rails. When it needs an email to authenticate it will use one of those 10 minute fake email websites with curl and playwright. I am conservative about my predictions but here is what we can learn from this incident and what I think is inevitably next:

    Chinese attackers used Anthropic (a hostile and expensive platform) because American SOTA is still ahead of Chinese models. Open weights is about 6-9 months behind closed SOTA. So by mid 2026 hackers will have the capability to secretly host open weight models on generic cloud hardware and relay agentic attacks through botnets to any point on the internet.

    There is an arms race between the blackhats and private companies to build the best hacking agents, and we are running out of things the agent CAN'T do. The major change from Claude 4 - Claude 4.5 was the ability to avoid rate limiting and WAF during web pentests, and we think that the next step for this is AV evasion. When Claude 4.7 comes out, if it is able to effectively evade anti-virus, companies are in for a rude awakening. Just my two cents.

  11. That's what I thought when starting and it functions so poorly that I think they should remove it from their docs. You can enforce a schema by creating a tool definition with json in the exact shape you want the output, then set "tool_choice" to "any". They have a picture that helps.

    https://docs.claude.com/en/docs/agents-and-tools/tool-use/im...

    Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.

  12. LLM 'neurons' are not single input/single output functions. Most 'neurons' are Mat-Vec computations that combine the products of dozens or hundreds of prior weights.

    In our lane the only important question to ask is, "Of what value are the tokens these models output?" not "How closely can we emulate an organic bran?"

    Regarding the article, I disagree with the thesis that AGI research is a waste. AGI is the moonshot goal. It's what motivated the fairly expensive experiment that produced the GPT models, and we can look at all sorts of other hairbrained goals that ended up making revolutionary changes.

  13. You can address the issue by putting the report and the code base in a sandbox with an agent that tries to reproduce it. If it can't reproduce it then that should be a strike against the reporter. OSS projects should absolutely ban accounts that repetitively create reports that are of such low quality that it can't be recreated. IMO the Hacker One reputation mechanism is a good idea because it incentives users who operate in good faith and can serially produce findings.
  14. Microcenter does this too. When you are in the checkout you can see the cashier using a DOS interface to ring up the items.
  15. (1) JSON requires lots of escape characters that mangle the strings + hex escapes and (2) it's much easier for model attention to track when a semantic block begins and ends when it's wrapped by the name of that section

    <instructions>

    ...

    ...

    </instructions>

    can be much easier than

    {

    "instructions": "..\n...\n"

    }

    especially when there are newlines, quotes and unicode

  16. 90% as good as Sonnet 4 or 4.5? Openrouter just started reporting, and it's saying Haiku is 2x as fast (60tps vs 125tps) and 2-3x less latent (2-3s vs 1s)
  17. Sonnet 4.5 is an excellent model for my startup's use case. Chatting to Haiku it looks promising too, and it may be great drop in replacement for some of inference tasks that have a lot of input tokens but don't require 4.5-level intelligence.
  18. No. I can prototype in 20 mins things that would have taken me a day before.
  19. There is a deep literature on this in the High Performance Computing (HPC) field, where researchers traditionally needed to design simulations to run on hundreds to thousands of nodes with up to hundreds of CPU threads each. Computation can be defined as dependency graphs at the function or even variable level (depending on how granular you can make your threads). Languages built on top of LLVM or interpreters that expose AST can get you a long way there.
  20. I disagree with this model because it assumes processing occurs at a point and memory is (optimally) distributed across space around it in every direction in an analog to a Von Neumann CPU architecture. However it is entirely possible to distribute compute with memory. For example, Samsung has a technology called PIM (Processing in Memory) where simple compute units are inserted inside HBM memory layers. Algorithms that can take advantage of this run much faster and at much lower power because it skips the bus entirely. More importantly, the compute scales in proportion to the memory size/space.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal