- > 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?
- > Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
I had a very similar impression (wrote more in https://hua.substack.com/p/are-longer-context-windows-all-yo...).
One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.
Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.
- 1 point
- from an AI research perspective -- it's pretty straightforward to mitigate this attack
1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.
2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...
- nice work! I wrote a similar library (https://github.com/stillmatic/gollum/blob/main/packages/vect...) and similarly found that exact search (w/the same simple heap + SIMD optimizations) is quite fast. with 100k objects, retrieval queries complete in <200ms on an M1 Mac. no need for a fancy vector DB :)
that library used `viterin/vek` for SIMD math: https://github.com/viterin/vek/
- reminds me a lot of rmarkdown - which allows you to run many languages in a similar fashion https://rmarkdown.rstudio.com/
- in particular it appears that they only implement data parallel DP - at 1.2B you can fit full copy of model into memory, but larger models require splitting the weights across multiple machines (different techniques eg distributed data parallel DDP, tensor parallel TP, pipeline parallel TP, ...)
without more details it's unclear if the proposed technique keeps its speedups in that case
- > Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
We are also not exactly looking letter by letter at everything we read.
- I gave the same article to Claude 3.5 Sonnet and the result seems reasonably similar to the author's handwritten summary.
``` This article examines the governance of Dutch pension funds in light of the Future of Pensions Act (Wtp). The new legislation shifts towards more complete pension contracts and emphasizes operational execution, necessitating changes in pension fund governance. The authors propose strengthening pension funds' internal organization, improving accountability to participants, and enhancing the powers of participant representation bodies. Key recommendations include establishing a recognizable governance structure with clear responsibilities, creating a College of Stakeholders (CvB) to replace existing accountability bodies, and granting the CvB more authority, including appointment and dismissal powers. The proposals aim to balance the interests of social partners, pension funds, and participants while ensuring transparency and effective oversight. The article emphasizes principles such as transparency, trust, loyalty, and prudence in shaping governance reforms. It also discusses the impact of digitalization (DORA), the need for pension funds to demonstrate value, and the potential for further consolidation in the sector. International perspectives, including insights from the World Bank, inform the proposed governance improvements. These changes are designed to help pension funds adapt to the new system, manage risks effectively, and maintain their "license to operate" in a changing landscape. ```
Similarly, the second article's summary also captures the key points that the author points out (emphasis mine).
``` The article "Regulating pensions: Why the European Union matters" explores the growing influence of EU law on pension regulation. While Member States retain primary responsibility for pension provision, the authors argue that EU law significantly impacts national pension systems through both direct and indirect means. The paper begins by examining the EU's institutional framework regarding pensions, focusing on the principles of subsidiarity and the division of powers between the EU and Member States. It emphasizes that the EU can regulate pension matters when the Internal Market's functioning is at stake, despite lacking specific regulatory competencies for pensions. The authors note that the subsidiarity principle has not proven to be an obstacle for EU action in this area. The article then delves into EU substantive law and its impact on pensions, concentrating on the concept of Services of General Economic Interest (SGEI) and its role in classifying pension fund activities as economic or non-economic. The authors discuss the case law of the Court of Justice of the European Union (CJEU), highlighting its importance in determining when pension schemes fall within the scope of EU competition law. They emphasize that the CJEU's approach is based on the degree of solidarity in the scheme and the extent of state control. ** The paper examines the IORP Directive, outlining its current scope and limitations. The authors argue that the directive is unclear and leads to distortions in the internal market, particularly regarding the treatment of pay-as-you-go schemes and book reserves. They propose a new regulatory framework that distinguishes between economic and non-economic pension activities. For non-economic activities, the authors suggest a soft law approach using a non-binding code or communication from the European Commission. This would outline the basic features of pension schemes based on solidarity and the conditions for exemption from EU competition rules. For economic activities, they propose a hard law approach following the Lamfalussy technique, which would provide detailed regulations similar to the Solvency II regime but tailored to the specifics of IORPs (Institutions for Occupational Retirement Provision). ** The authors conclude that it's impossible to categorically state whether pensions are a national or EU competence, as decisions must be made on a case-by-case basis. They emphasize the importance of considering EU law when drafting national pension legislation and highlight the need for clarity in the division of powers between the EU and Member States regarding pensions. Overall, the paper underscores the complex interplay between EU law and national pension systems, calling for a more nuanced understanding of the EU's role in pension regulation and a clearer regulatory framework that respects both EU and national competencies. ```
I'd bet that the author used GPT 3.5-turbo (aka the free version of ChatGPT) and did not give any particular prompting help. To create these, I asked Claude to create a prompt for summarization with chain of thought revision, used that prompt, and returned the result. Better models with a little bit more inference time compute go a long way.
- except to the extent that your voice may be part of your image, which is actionable: https://en.wikipedia.org/wiki/Midler_v._Ford_Motor_Co.
- The samples were released a while back: https://google-research.github.io/seanet/stream_vc/
- Last week, a vision-language model made the rounds on Twitter and Hacker News (https://www.hackerneue.com/item?id=40505099, made front page). However, the model code and weights were copied from another team's work, MiniCPM-V, without attribution. The original authors have removed their repo and HuggingFace model (https://x.com/var_epsilon/status/1797628346156945459) and the MiniCPM authors have reproduced very damning evidence (https://x.com/zhanga6/status/1797293189378068768). Notably, the model was originally trained on held out examples of ancient Chinese script, which their university recently scanned, and the two models perform identically, which should not be possible because of the uniqueness of the data. I consider this watermarking to be pretty clever!
There's plenty of paper plagiarism, but this is the first case I've seen of model plagiarism.
- 21 points
- it's probably correct to think of functionally all ML models as being stateless. even something like twitter/fb feed - the models themselves remain the same (usually updated 1-2x per month IIRC) - only the data and the systems change.
an illustrative example: say you open twitter, load some posts, then refresh. the model's view of you is basically the same, even the data is basically the same. you get different posts, however, because there is a system (read: bloom filter) on top of the model that chooses which posts go into ranking and that system removes posts that you've already seen. similarly, if you view some posts or like them, that updates a signal (e.g. time on user X profile) but not the actual model.
what's weird about LLM's is that they're modeling the entire universe of written language, which does not actually change that frequently! now, it is completely reasonable to instead consider the problem to be modeling 'a given users' preference for written language' - which is personalized and can change. this is a different feedback to really gather and model towards. recall the ranking signals - most people don't 'like' posts even if they do like them, hence reliance on implicit signals like 'time spent.'
one approach I've considered is using user feedback to steer different activation vectors towards user-preferred responses. that is much closer to the traditional ML paradigm - user feedback updates a signal, which is used at inference time to alter the output of a frozen model. this certainly feels doable (and honestly kinda fun) but challenging without tons of users and scale :)
- the biggest difference is that existing multimodal models (eg GPT-4V and MM1) trained the text model first, and then added in the image component after text training was done ('late fusion'). MM1 learns a projection into the text space, not discrete tokens, and thus cannot generate images.
Other work allows the model during training to learn the 'tokenization' more explicitly. that's more similar to Adept's Fuyu architecture, which I am personally a fan of, but also does not enable generating images out.
You can generate images using late fusion as well, though I am not aware of other public work that discloses both early fusion and image generation.
- related previous discussion https://www.hackerneue.com/item?id=39993626
anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too