agucova.dev - email: hn@agucova.dev
[ my public key: https://keybase.io/agucova; my proof: https://keybase.io/agucova/sigs/UXfmBddPlca_aGeUF959N4GLfFuoFgfgXvzk6-n8Y1g ]
- agucovaFWIW I work on AI and I also trust Pangram quite a lot (though exclusively on long-form text spanning at least 4 or more paragraphs). I'm pretty sure the book is heavily AI written.
- How long were the extracts you gave to Pangram? Pangram only has the stated very high accuracy for long-form text covering at least a handful of paragraphs. When I ran this book, I used an entire chapter.
- I ran the introduction chapter through Pangram [1], which is one of the most reliable AI-generated text classifiers out there [2] (with a benchmarked accuracy of 99.85% over long-form text), and it gives high confidence for it having been AI-generated. It's also very intuitively obvious if you play a lot with LLMs.
I have no problem at all reading AI-generated content if it's good, but I don't appreciate dishonesty.
[1]: https://www.pangram.com/ [2]: https://arxiv.org/pdf/2402.14873
- This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.
- For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination).
They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):
> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...
- I’m guessing he’s probably talking about LessWrong, which nowadays also hosts a ton of serious safety research (and is often dismissed offhandedly because of its reputation as an insular internet community).
- I mean, this is how the Reflection model works. It's just hiding that from you in an interface.
- You can use Daggity.jl: https://docs.juliahub.com/Dagitty/kxRMH/0.0.1/
- I agree. I’m really more concerned about bioweapons, for which it’s generally understood (in security studies) that access to technical expertise is the limiting factor for terrorists. See Al Qaeda’s attempts to develop bio weapons in 2001.
- I imagine you meant societal harms? I think this was mostly my fault. I edited the areas of work a bit to better reflect what the UK AISI is actually working on right now.
- I recommend checking out the UK AISI's work on this: - https://www.gov.uk/government/publications/ai-safety-institu... - https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-upd...
- > A government agency determining limits on, say, heavy metals in drinking water is materially different than the government making declarations of what ideas are safe and which are not
Access to evaluate the models basically means the US governments gets to know what these models are capable of, but the US AISI has basically no enforcement power to dictate anything to anyone.
This is just a wild exaggeration of what's happening here.
- > Because lobbying exists in this country, and because legislators receive financial support from corporations like OpenAI, any so-called concession by a major US-based company to the US Government is likely a deal that will only benefit the company.
Sometimes both benefit? OAI and Anthropic benefit from building trust with government entities early on, and perhaps setting a precedent of self-regulation over federal regulation, and the US government gets to actually understand what these models are capable of, and have competent people inside the government track AI progress and potential downstream risks from it.
- > My issue with AI safety is that it's an overloaded term. It could mean anything from an llm giving you instructions on how to make an atomic bomb to writing spicy jokes if you prompt it to do so. it's not clear which safety these regulatory agencies would be solving for.
I think if you look at the background of the people leading evaluations at the US AISI [1], as well as the existing work on evaluations by the UK AISI [2] and METR [3], you will notice that it's much more the former than the latter.
[1]: https://www.nist.gov/people/paul-christiano [2]: https://www.gov.uk/government/publications/ai-safety-institu... [3]: https://arxiv.org/abs/2312.11671
- > What exactly does the evaluation entail?
I believe the US AISI has published less on their specific approach, but they’re largely expected to follow the general approach implemented by the UK AISI [1] and METR [2].
This is mostly focused on evaluating models on potentially dangerous capabilities. Some major areas of work include:
- Misuse risks: For example, determining whether models have (dual-use) expert-level knowledge in biology and chemistry, or the capacity to substantially facilitate large scale cyber attacks. A good example of this is the work by Soice et al on bioweapon uplift [5] or Meta's work on CYBERSECEVAL [6], respectively.
- Autonomy: Whether models are capable of agent-like behavior, like the kind that would be hard for humans to control. A big sub-area is Autonomous Replication and Adaptation (ARA), like the ability of the model to escape simulated environments and exfiltrate its own weights. A good example is METR's original set of evaluations on ARA capabilities [3].
- Safeguards: How vulnerable these models are to say, prompt injection attacks or jailbreaks, especially if they're also in principle capable of other dangerous capabilities (like the ones above). Good examples here are the UK AISI's work developing in-house attacks on frontier LLMs [4].
Labs like OAI, Anthropic and GDM already perform these internally as they're part of their respective responsible scaling policies, which determine which safety measures they should have implemented for every given 'capability' level of their models.
[1]: https://www.gov.uk/government/publications/ai-safety-institu... [2]: https://metr.org/ [3]: https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks.... [4]: https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-upd... [5]: https://arxiv.org/abs/2306.03809 [6]: https://ai.meta.com/research/publications/cyberseceval-3-adv...
- > If training data contains multiple conflicting perspectives on a topic, the LLM has a limited ability to recognize that a disagreement is present and what types of entities are more likely to adopt which side. That is what those studies are reflecting.
Again, we have empirical evidence to suggest otherwise. It's not that there's an oracle, but that the LLM does internally differentiate between facts it has stored as simple truth vs. misconceptions vs. fiction.
This becomes obvious by interacting with popular LLMs; they can produce decent essays explaining different perspectives on various issues, and it makes total sense that they can because if you need to predict tokens on the internet, you better be able to take on different perspectives.
Hell, we can even intervene these internal mechanisms to elicit true answers from a model, in contexts where you would otherwise expect the LLM to output a misconception. To quote a recent paper, "Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface" [1], and this matches the rest of the interpretability literature on the topic.
- 1 point
- I messed up the second reference, it should be https://arxiv.org/abs/2212.03827
- This isn't really true. LLMs are discriminating actual truth (though perhaps not perfectly). Other similar studies suggest that they can differentiate, say, between commonly held misconceptions and scientific facts, even when they're repeating the misconception in a context. This suggests models are at least sometimes aware when they're bullshitting or spreading a misconception, even if they're not communicating it.
This makes sense, since you would expect LLMs to perform better when they can differentiate falsehoods from truths, as it's necessary for some contextual prediction tasks (say, the task of predicting Snopes.com, or predicting what would a domain expert say about topic X).
- > Now, we can see from this description that nothing about the modeling ensures that the outputs accurately depict anything in the world. There is not much reason to think that the outputs are connected to any sort of internal representation at all.
This is just wrong. Accurate modelling of language at the scale of modern LLMs requires these models to develop rich world models during pretraining, which also requires distinguishing facts from fiction. This is why bullshitting happens less with better, bigger models: the simple answer is that they just know more about the world, and can also fill in the gaps more efficiently.
We have empirical evidence here: it's even possible to peek into a model to check whether the model 'thinks' what it's saying is true or not. From “Discovering Latent Knowledge in Language Models Without Supervision” (2022) [1]:
> Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. (...) We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.
So when a model is asked to generate an answer it knows is incorrect, it's internal state still tracks the truth value of the statements. This doesn't mean the model can't be wrong about what it thinks is true (or that it won't try to fill in the gaps incorrectly, essentially bullshitting), but it does mean that the world models are sensitive to truth.
More broadly, we do know these models have rich internal representations, and have started learning how to read them. See for example “Language Models Represent Space and Time” (Wes & Tegmark, 2023) [2]:
> We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. While further investigation is needed, our results suggest modern LLMs learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model.
For anyone curious, I can recommend the Othello-GPT paper as a good introduction to this problem (“Do Large Language Models learn world models or just surface statistics?”) [3].
[1]: https://arxiv.org/abs/2310.02207