Website: http://izbicki.me
- What's the origin of XXX? I've seen FIXME/NOTE/TODO all over the place, but never encountered XXX before.
- Except those papers are 8ish years old; they actually were among the first 2-3 algs for this task; and they studied the fully general vector space alignment problem. But I agree that naming things is hard and don't have a better name.
- > We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables.
Sorry if I'm being obtuse, but I don't see any mention of the POT package in your paper or of what specific algorithms you used from it to compare against. My best guess is that you used the linear map similar to the example at <https://pythonot.github.io/auto_examples/domain-adaptation/p...>. The methods I mentioned are also linear, but contain a number of additional tricks that result in much better performance than a standard L2 loss, and so I would expect those methods to outperform your OT baseline.
> As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.
But both of those papers are about generic vector alignment, so the generality of the name makes sense. Your contribution here seems specifically about the LLM use case, and so a name that implies the LLM use case would be preferable.
I do agree though that in general naming is hard and I don't have a better name to suggest. I also agree that there's lots of related papers, and you can't cite/discuss them all reasonably.
And I don't mean to be overly critical... the application to LLMs is definitely cool. I wouldn't have read the paper and written up my critiques if I didn't overall like it :)
- I hate to be "reviewer 2", but:
I used to work on what your paper calls "unsupervised transport", that is machine translation between two languages without alignment data. You note that this field has existed since ~2016 and you provide a number of references, but you only dedicate ~4 lines of text to this branch of research. There's no comparison about why your technique is different to this prior work or why the prior algorithms can't be applied to the output of modern LLMs.
Naively, I would expect off-the-shelf embedding alignment algorithms (like <https://github.com/artetxem/vecmap> and <https://github.com/facebookresearch/fastText/tree/main/align...>, neither of which are cited or compared against) to work quite well on this problem. So I'm curious if they don't or why they don't.
I can imagine there is lots of room for improvements around implicit regularization in the algorithms. Specifically, these algorithms were designed with word2vec output in mind (typically 300 dimensional vectors with 200000 observations), but your problem has higher dimensional vectors with fewer observations and so would likely require different hyperparameter tuning. IIRC, there's no explicit regularization in these methods, but hyperparameters like stepsize/stepcount can implicitly add L2 regularization, which you probably need for your application.
---
PS.
I *strongly dislike* your name of vec2vec. You aren't the first/only algorithm for taking vectors as input and getting vectors as output, and you have no right to claim such a general title.
---
PPS.
I believe there is a minor typo with footnote 1. The note is "Our code is available on GitHub." but it is attached to the sentence "In practice, it is unrealistic to expect that such a database be available."
- It seems like you have some misconceptions about Strassen's alg:
1. It is a standard example of the divide and conquer approach to algorithm design, not the dynamic programming approach. (I'm not even sure how you'd squint at it to convert it into a dynamic programming problem.)
2. Strassen's does not require complex valued matrices. Everything can be done in the real numbers.
- As a CS prof, I'd love to have this in my office for students to play with. Looks awesome!
- That all make senses to me. But it definitely won't make sense to my intro to programming students. They already have enough weird syntax to juggle.
- Building off this question, it's not clear to me why Python should have both t-strings and f-strings. The difference between the two seems like a stumbling block to new programmers, and my "ideal python" would have only one of these mechanisms.
- > I've seen a lot of examples that use CSS to show the prompt or line number without it becoming part of copied text, and I'm highly in favor of that.
This is unfortunately not compatible with writing the tutorial in markdown to be rendered on github.
- I have a minor nit to pick. I actually prefer when tutorials provide the prompts for all code snippets for two reasons:
1. Many tutorials reference many languages. (I frequently write tutorials for students that include bash, sql, and python.) Providing the prompts `$`, `sqlite>` and `>>>` makes it obvious which language a piece of code is being written in.
2. Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line. A good example is a sequence of commands that involves `sudo dd` to format a harddrive. But for really intro-level stuff I want the student/reader to carefully think about all the commands, and forcing them to copy/paste line by line helps achieve that goal.
That said, this is an overall good introduction to writing that I will definitely making required reading for some of my data science students. When the book is complete, I'll be happily buying a copy :)
- > I can talk about concepts like "atoms" or "bacteria" or "black holes" with anyone, and they'll know what they are - even if their knowledge of those subjects isn't in depth.
I'm not convinced this is an unalloyed good. Knowing that a disease is caused by "bacteria" instead of "demons" isn't really helpful if you don't have a deep understanding of exactly what bacteria is. See, for example, all of the people who want antibiotics whenever they're sick for any reason. We've just replaced one set of weird beliefs in the general populace with another and given it a veneer of science.
- I think you're wrong.
Suicide does not have stable reporting rates. It was very stigmatized in the past, and so investigators would notoriously report suicides as "unknown cause of death" if they could.
Violent crime, on the other hand, is much more correlated with things like poverty than with mental health.
I think it's quite obviously the case that there are no clear indicators about what "mental health" looked like 100 years ago and there. Any projections into the past will involve a lot of extrapolation and have all sorts of biases.
- They very clearly explain why this matters in the "Why should I care?" section. Partially quoting them:
> Harry Potter is an innocent example, but this problem is far more costly when it comes to higher value use-cases. For example, we analyze insurance policies. They’re 70-120 pages long, very dense and expect the reader to create logical links between information spread across pages (say, a sentence each on pages 5 and 95). So, answering a question like “what is my fire damage coverage?” means you have to read: Page 2 (the premium), Page 3 (the deductible and limit), Page 78 (the fire damage exclusions), Page 94 (the legal definition of “fire damage”).
It's not at all obvious how you could write code to do that for you. Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task, even if there are "better" ways of solving the Harry Potter problem.
- The WHO list of essential medicines is not just over-the-counter drugs. It includes things like the chemotherapy drug cisplatin. I happened to need that for testicular cancer ~10 years ago, and the treatment cost was $50k (as "payed" by insurance). That overall seems pretty reasonable to me for the treatment I received, but definitely not something I'd expect the median American to be able to pay out of pocket.
- You're wording in this comment (and the twitter/comment video) gives off the same vibes as the google april 1st videos for things like gmail motion (https://www.youtube.com/playlist?list=PLAD8wFTLnQKeDsINWn8Wj...). I honestly thought this was full sarcasm at first.
- I don't see how that would have helped in this case. This was not a resource at a known location that was supposed to be only available to logged in users. This was a resource that the admins didn't know about available at an unknown url that was exposed to the public internet due to a configuration error. Are you going to write a test case for every possible url in your server to make sure it's not being exposed?
Something that could work is including a random hash as a first hidden email inside of every client, and then regularly searching outbound traffic for that hash. But that would be rather expensive.
- 2 points
- There was rather a lot of NATO coordination in the US-led invasions of both Iraq and Afghanistan. None of the military missions in these countries were in response to the Article V mutual defense clause of the NATO treaty. It's very easy to see how these operations (and therefore the NATO alliance) would be seen as aggressive to these countries.
- This is false. Standard sampling algorithms like beamsearch can "backtrack" and are widely used in generative language models.
It is true that the runtime of these algorithms is exponential in the length of the sequence, and so lots of heuristics are used to reduce this runtime in practice, and this limits the "backtracking" ability. But this limitation is purely for computational convenience's sake and not something inherent in the model.
- One of the reasons I've intentionally decided not become independently wealthy is that I want to have to explain to other people why I'm doing things. Part of my work is "charity-ish", and by not being able to do things on my own, I'm forced me to improve my communication skills and involve other people in these charity activities. I think that ultimately improves the final outcome, even if the process is immensely more frustrating.
- 36 points
- I suspect the numbers would be worse if you looked at households instead of individuals due to declining marriage rates (but I'm not willing to put in the effort to find numbers).
- And the next step after this epiphany is that you still have to remember to take the phone with you places, not to leave it behind, and worry about it getting dropped in the toilet by a toddler. Not caring this tool still has a lot of benefits.
- Do you happen to have a link to the proposal I can see and share with a class? I'm teaching a few lectures about some "weird" stuff this semester, and this would be a great example.
- Then computer security. Unlike the internet or jet engines, these have not panned out as foundational research (except perhaps for some of HIV)
In what word is computer security not a foundational topic? There's lots of reasons to critique the way NSF/NIH/DOD/etc allocate funding, but this is definitely not one of them.
- These exercises are writing mathematical proofs that basic machine learning algorithms behave correctly. They are "pen and paper" not because you are manually solving a large equation that a machine would normally solve, but because we don't have automated theorem provers capable of proving interesting machine learning theorems. I would expect a typical 1st year grad student to be using a resource like this.
If you don't understand the purpose of proofs, then this resource is not aimed at you.
- I believe you are incorrect. According to wikipedia:
> The Lindy effect is a theorized phenomenon by which the future life expectancy of some non-perishable things, like a technology or an idea, is proportional to their current age.
This implies that things that have been around for a short period of time do in fact have a short expected lifespan. You're correct that "A implies B does not mean B implies A as well", but that assumption is not needed.
- My work brings me into regular contact with DPRK IT professionals, for example by [teaching open source sotware](https://izbicki.me/blog/teaching-open-source-in-north-korea....) or [teaching proper web design](https://izbicki.me/blog/fixing-north-korea-kcna-webpage.html). I make a lot of effort to respect sanctions, but documents like this are incredibly unhelpful. I've read through the document, and it seems completely devoid of actionable, DPRK-specific information that can help IT professionals avoid sanctions violations. For example, the document encourages websites to monitor for the following activity as "indications of DPRK IT workers who may be using their platforms":
• Multiple logins into one account from various IP addresses in a relatively short period of time, especially if the IP addresses are associated with different countries;
• Developers are logging into multiple accounts on the same platform from one IP address;
• Developers are logged into their accounts continuously for one or more days at a time;
• Router port or other technical configurations associated with use of remote desktop sharing software, such as port 3389 in the router used to access the account, particularly if usage of remote desktop sharing software is not standard company practice;
• Developer accounts use a fraudulent client account to increase developer account ratings, but both the client and developer accounts use the same PayPal account to transfer/withdraw money (paying themselves with their own money);
• Frequent use of document templates for things such as bidding documents and project communication methods, especially the same templates being used across different developer accounts;
• Multiple developer accounts receiving high ratings from one client account in a short period, with similar or identical documentation used to establish the developer accounts and/or the client account;
• Extensive bidding on projects, and a low number of accepted project bids compared to the number of projects bids on by a developer; and
• Frequent transfers of money through payment platforms, especially to PRC-based bank accounts, and sometimes routed through one or more companies to disguise the ultimate destination of the funds.
This list is so generic that I'm not sure what the point of it is. I think it would make sense to ban some of these practices from a general security perspective. But these practices would give way too many false positives if you were trying to use them to identify DPRK developers.
I'm honestly really confused about who the target audience is for publications like this. It can't be actual IT professionals due to the lack of actionable information. Is it journalists? Do we publish these things just to remind them that we don't like the DPRK?
Your second paragraph is implying that the half of Americans who voted for Trump are "bad Americans". That seems to be sowing the division that your first paragraph warns against (even if it is a reason to dislike Trump).
I don't think either democrats or republicans can claim the moral high ground about sowing division.