Founder, Hotpot.ai and HotpotBio (Hotpot.ai/bio)
Better data for lung cancer: https://hotpot.ai/non-smoking-lung-cancer (We don't financially benefit, only trying to plug holes in the system).
Reevaluating Viral Etiology and Rational Diagnosis of ME/CFS and Low-grade Brain Parenchymal Inflammation: message for pre-publication access.
EBV + breast cancer: https://www.biorxiv.org/content/10.1101/2024.11.28.625954v2
Please message for free Hotpot credits. HN is an invaluable source of knowledge. I would be honored to give back.
- A more accurate title: "Are Cornell Students Meritocratic and Efficiency-Seeking? Evidence from 271 MBA students and 67 Undergraduate Business Students."
This topic is important and the study interesting, but the methods exhibit the same generalizability bias as the famous Dunning-Kruger study.
The referenced MBA students -- and by extension, the elites -- only reflect 271 students across two years, all from the same university.
By analyzing biased samples, we risk misguided discourse on a sensitive subject.
@dang
- Thanks. This is helpful. Looking forward to more of your thoughts.
Some nuance:
What happens when the methods are outdated/biased? We highlight a potential case in breast cancer in one of our papers.
Worse, who decides?
To reiterate, this isn’t to discourage the idea. The idea is good and should be considered, but doesn’t escape (yet) the core issue of when something becomes a “fact.”
- Valid critique, but one addressing a problem above the ML layer at the human layer. :)
That said, your comment has an implication: in which fields can we trust data if incentives are poor?
For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?
These are hard questions.
ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.
Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."
- If you agree that ML starts with philosophy, not statistics, this is but one example highlighting how biomedicine helps model development, LLMs included.
Every fact is born an opinion.
This challenge exists in most, if not all, spheres of life.
- 100% agreed. I also advise you not to read many cancer papers, particularly ones investigating viruses and cancer. You would be horrified.
(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)
- To elaborate, errors go beyond data and reach into model design. Two simple examples:
1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".
2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.
There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.
We need way more people thinking about biomedical AI.
- This is long overdue for biomedicine.
Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.
Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.
We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.
If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.
Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.
Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.
- The author is a respected voice in tech and a good proxy of investor mindset, but the LLM claims are wrong.
They are not only unsupported by recent research trends and general patterns in ML and computing, but also by emerging developments in China, which the post even mentions.
Nonetheless, the post is thoughtful and helpful for calibrating investor sentiment.
- More like alarming anecdote. :) Google did a wonderful job relabeling MedQA, a core benchmark, but even they missed some (e.g., question 448 in the test set remains wrong according to Stanford doctors).
For ML, start with MedGemma. It's a great family. 4B is tiny and easy to experiment with. Pick an area and try finetuning.
Note the new image encoder, MedSigLIP, which leverages another cool Google model, SigLIP. It's unclear if MedSigLIP is the right approach (open question!), but it's innovative and worth studying for newcomers. Follow Lucas Beyer, SigLIP's senior author and now at Meta. He'll drop tons of computer vision knowledge (and entertaining takes).
For bio, read 10 papers in a domain of passion (e.g., lung cancer). If you (or AI) can't find one biased/outdated assumption or method, I'll gift a $20 Starbucks gift card. (Ping on Twitter.) This matters because data is downstream of study design, and of course models are downstream of data.
Starbucks offer open to up to three people.
- Thanks, but no one truly understands biomedicine, let alone biomedical ML.
Feynman's quote -- "A scientist is never certain" -- is apt for biomedical ML.
Context: imagine the human body as the most devilish operating system ever: 10b+ lines of code (more than merely genomics), tight coupling everywhere, zero comments. Oh, and one faulty line may cause death.
Are you more interested in data, ML, or biology (e.g., predicting cancerous mutations or drug toxicology)?
Biomedical data underlies everything and may be the easiest starting point because it's so bad/limited.
We had to pay Stanford doctors to annotate QA questions because existing datasets were so unreliable. (MCQ dataset partially released, full release coming).
For ML, MedGemma from Google DeepMind is open and at the frontier.
Biology mostly requires publishing, but still there are ways to help.
After sharing preferences, I can offer a more targeted path.
- Agreed. There is deep potential for ML in healthcare. We need more contributors advancing research in this space. One opportunity as people look around: many priors merit reconsideration.
For instance, genomic data that may seem identical may not actually be identical. In classic biological representations (FASTA), canonical cytosine and methylated cytosine are both collapsed into the letter "C" even though differences may spur differential gene expression.
What's the optimal tokenization algorithm and architecture for genomic models? How about protein binding prediction? Unclear!
There are so many open questions in biomedical ML.
The openness-impact ratio is arguably as high in biomedicine as anywhere else: if you help answer some of these questions, you could save lives.
Hopefully, awesome frameworks like this lower barriers and attract more people.
- Thank you both for an illuminating thread. Comments were concise, curious, and dense with information. Most notably, there was respectful disagreement and a levelheaded exchange of perspective.
- To provide more color on cancers caused by viruses, the World Health Organization (WHO) estimates that 9.9% of all cancers are attributable to viruses [1].
Cancers with established viral etiology or strong association with viruses include:
- Cervical cancer - Burkitt lymphoma - Hodgkin lymphoma - Gastric carcinoma - Kaposi’s sarcoma - Nasopharyngeal carcinoma (NPC) - NK/T-cell lymphomas - Head and neck squamous cell carcinoma (HNSCC) - Hepatocellular carcinoma (HCC)
- 4 points
- It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.
With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.
- Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.
For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.
The calculus is even worse for SOTA LLMs.
The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.
- Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.
Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.
- Nvidia (NVDA) generates revenue with hardware, but digs moats with software.
The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.
OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.
Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.
They had the talent and the incentive to migrate, but didn't.
In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.
People are desperate to quit their NVDA-tine addiction, but they can't for now.
[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]
- To address the downvotes, this comment isn't guaranteeing OAI's success. It merely notes the remarkably elevated probability of OAI escaping Nadella's grip, which was nearly unfathomable 12 months ago.
Even after breaking free, OAI must still contend with intense competition at multiple layers, including UI, application, infrastructure, and research. Moreover, it may need to battle skilled and powerful incumbents in the enterprise space to sustain revenue growth.
While the outcome remains highly uncertain, the progress since the board fiasco last year is incredible.
For instance, it is not uncommon for cancer studies to design assays around non-oncogenic strains, or for assays to use primer sequences with binding sites mismatched to a large number of NCBI GenBank genomes.
Another example: studies relying on The Cancer Genome Atlas (TCGA), which is a rich database for cancer investigations. However, the TCGA made a deliberate tradeoff to standardize quantification of eukaryotic coding transcripts but at the cost of excluding non-poly(A) transcripts like EBER1/2 and other viral non-coding RNAs -- thus potentially understating viral presence.
Enjoy the rabbit hole. :)