Preferences

ehsanu1
Joined 1,342 karma
ehsanul@ehsanul.com

I'm actually ehsanul (http://news.ycombinator.com/user?id=ehsanul) here on HN, but I switched Google accounts.


  1. But I doubt you can opt in to them training on that data coming in via OpenCode.
  2. I understand them not wanting to allow non-coding agents to use the subscription, but why specifically block another coding agent? Is the value Anthropic gets from users specifically using claude code that high? Is it about the training data opt-ins?
  3. What exactly do you mean by custom tools here? Just cli tools accessible to the agent?
  4. Doing something merely requires I/O. Brains wouldn't be doing much without that. A sufficiently accurate simulation of a fundamentally computational process is really just the same process.
  5. The DB specifically, or the concept of event sourcing? Event sourcing is not a new approach and has a lot of similarities with temporal's approach, though temporal events are not necessarily business events and deterministic event replay is required with temporal. In the general case of event sourcing, arbitrary processing might be done on the event stream to produce some final state or do whatever needs to happen for your use case. As long as you're persisting the events and using events as the basis for your business logic and state, you're doing event sourcing.

    I dont know anything about this specific DB though, if that was what you were wondering about, that's more of an implementation-level detail. Temporal server just uses regular mysql and supports mutiple storage backends.

  6. This seems like a good template to generate synthetic data, with positive/negative examples, allowing an embedding model to be aligned more semantically to underlying concepts.

    Anyways, I'd hope reranking models do better, have you tried those?

  7. Do you assign different responsibilities to different LSP servers when there multiple I suppose?
  8. Using Research->Plan->Implement flow is orthogonal, though I notice parts of those do exist as skills too. But you sometimes need to do other things too, e.g. debugging in the course of implementing or specific techniquws to improve brainstorming/researching.

    Some of these skills are probably better as programmed workflows that the LLM is forced to go through to improve reliability/consistency, that's what I've found in my own agents, rather than using English to guide the LLM and trusting it to follow the prescribed set of steps needed. Some mix of LLMs (choosing skills, executing the fuzzy parts of them) and just plain code (orchestration of skills) seems like the best bet to me and what I'm pursuing.

  9. Seeing some stats would be fun. I wonder what the amount of data is here. And the distribution would be interesting too, especially since some pages are archived at multiple points in time, and pages have been getting heavier these days.
  10. I see no conflict between AGPL and SaaS: https://opensource.stackexchange.com/a/12988
  11. Are these actually different models vs just different names from the open weights releases?
  12. I'm reading: the difference is that this is an agent as a judge rather than an LLM as a judge, paired with more structured judging parameters. Is that right? Is the agent just a loop over each criterium, or is it also reflecting somehow on its judging or similar?
  13. I believe that's exactly the point: it's too easy to violate constraints like not allowing multiple mutable references. Unsafe is meant for cases where the validity of the code is difficult to prove with rust's lifetime analysis, but can be abused to do much more than that.
  14. It's hard to attribute PR merge rate with higher tool quality here. Another likely reason is level of complexity of task. Just looking at the first PR I saw from the github search for codex PRs, it was this one-line change that any tool, even years ago, could have easily accomplished: https://github.com/maruyamamasaya/yasukaribike/pull/20/files
  15. Where I work, our legal department requires making use of LLMs only through our own contractual relationships with model providers. Given that, BYOK is table stakes for me at least.

    Litellm is what we use internally, so we can support any LLM backend with any open source tool, and create virtual keys for each developer to monitor and manage usage limits etc.

  16. There seems to be a couple of field-specific journals of negative results for similar purposes. It seems like there should be value in citing negative results to inform current research. Perhaps if there were more journals dedicated to this, or a single one not limited to specific fields, there would still be some incentive to publish there, if the effort required was low enough (another area where AI might be applied: writing it up).
  17. It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.
  18. IMO just a rolling message history works for only the simplest of AI tools. Useful agents will tend towards much more complex state that extends into specific verticals/domains.
  19. Essentially, you don't need to think about time and space. You just write more or less normal looking code, using the Temporal SDK. Except it actually can resume from arbitrarily long pauses, waiting as long as it needs to for some signal, without any special effort beyond using the SDK. You also automatically get great observability into all running workflows, seeing inputs and outputs at each step, etc.

    The cost of this is that you have to be careful in creating new versions of the workflow that are backwards compatible, and it's hard to understand backcompat requirements and easy to mess up. And, there's also additional infra you need, to run the Temporal server. Temporal Cloud isn't cheap at scale but does reduce that burden.

  20. Temporal makes this easy and works great for such use cases. It's what I'm using for my own AI agents.
  21. What does Jetstream lack wrt queues/persistence?
  22. That was my initial position too, but I think there is a search efficiency story here as well. CoT comes in many flavors and improves when tailored to the problem domain. If the LLM can instead figure out the right strategy to use to problem solve for a given problem, this may improve performance per compute vs discovering this at inference time.

    Tailoring prompts is likely still the best way to maximize performance when you can, but in broader domains you'd work around this through strategies like asking the LLM to combine predefined reasoning modules, or creating multiple reasoning chains and merging/comparing them, explicit MCTS etc. I think those strategies will still be useful for a good while, but pieces of that search process, especially directing the search more efficiently, move to the LLMs over time as they get trained with this kind of data.

  23. It's due to how the RLHF and instruction tuning was done. IIRC, even the builtin system prompt works this way in ChatGPT.
  24. How much extra state and computation is it per token exactly? Can we account for the improvement in just those terms?
  25. I've only read the abstract, but also find this strange. I wonder if this is just tapping into the computational chains that are already available when tokens are further away, due to the positional encodings being trained that way. If so, that makes the reasoning/modeling powers of LLMs even more impressive and inscrutable.
  26. I've used usearch successfully for a small project: https://github.com/unum-cloud/usearch/
  27. What kind of best case are you imagining? I don't quite understand why the very best case would be dystopian.
  28. Has the title of the paper changed from what it was initially? It says "Have we built machines that think like people?" now, whereas the HN title is "Large language models lack deep insights or a theory of mind".
  29. If it works, and it's a one-off script, why do I care?

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal