Investing the effort to 1) recognize, without programmer intervention, that some random website has an API and then 2) automatically, without further programmer intervention, retrieve the website data from that API and make intelligent use of it, is just not worth it to them when retrieving the HTML just works every time.
edit: corrected inverted ratio
> The reality is that the ratio of "total websites" to "websites with an API" is likely on the order of 1M:1 (a guess).
This is entirely wrong. Aside from the vast number of WordPress sites, the other APIs the article mentions are things like ActivityPub, oEmbed, and sitemaps. Add on things like Atom, RSS, JSON Feed, etc. and the majority of sites have some kind of alternative to HTML that is easier for crawlers to deal with. It’s nothing like 1M:1.
> Investing the effort to 1) recognize, without programmer intervention, that some random website has an API and then 2) automatically, without further programmer intervention, retrieve the website data from that API and make intelligent use of it, is just not worth it to them when retrieving the HTML just works every time.
You are treating this like it’s some kind of open-ended exercise where you have to write code to figure out APIs on the fly. This is not the case. This is just “Hey, is there a <link rel=https://api.w.org/> in the page? Pull from the WordPress API instead”. That gets you better quality content, more efficiently, for >40% of all sites just by implementing one API.
Hrm…
>> Like most WordPress blogs, my site has an API.
I think WordPress is big enough to warrant the effort. The fact that AI companies are destroying the web isn't news. But they could certainly do it a with a little less jackass. I support this take.
For example, Reddit encouraged those tools to use the API, then once it gained traction, they began charging exorbitant fees effectively blocking every blocking such tools.
You know how you sometimes have to call a big company's customer support and try to convince some rep in India to press the right buttons on their screen to fix your issue, because they have a special UI you don't get to use? Imagine that, but it's an AI, and everything works that way.
"be conservative in what you send, be liberal in what you accept"
https://en.wikipedia.org/wiki/Robustness_principleThe more effective way to think about it is that "the ambiguity" silently gets blended into the data. It might disappear from superficial inspection, but it's not gone.
The LLM is essentially just doing educated guesswork without leaving a consistent or thorough audit trail. This is a fairly novel capability and there are times where this can be sufficient, so I don't mean to understate it.
But it's a different thing than making ambiguity "disappear" when it comes to systems that actually need true accuracy, specificity, and non-ambiguity.
Where it matters, there's no substitute for "very explicit structured data" and never really can be.
please do not write code. ever. Thinking like this is why people now think that 16GB RAM is to little and 4 cores is the minimum.
API -> ~200,000 cycles to get data, RAM O(size of data), precise result
HTML -> LLM -> ~30,000,000,000 cycles to get data, RAM O(size of LLM weights), results partially random and unpredictable
What do you think the E in perl stands for?
Multiply that by every site, and that approach does not scale. Parsing HTML scales.
In contrast, I can always trust that whatever is returned to be consumed by the browser is in the format that is consumable by a browser, because if it isn't the site isn't a website. Html is pretty much the only format guaranteed to be working.
using an llm to parse html -> please do not
You're absolutely welcome on your own free time to waste it on whatever feels right
> using an llm to parse html -> please do not
have you used any of these tools with a beginner's mindset in like, five years?
EDIT: I hemmed and hawed about responding to your attitude directly, but do you talk to people anywhere but here? Is this the attitude you would bring to normal people in your life?
Dick Van Dyke is 100 years old today. Do you think the embittered and embarrassing way you talk to strangers on the internet is positioning your health to enable you to live that long, or do you think the positive energy he brings to life has an effect? Will you readily die to support your animosity?
> Like most WordPress blogs, my site has an API.
WordPress, for all its faults, powers a fair number of websites. The schema is identical across all of them.
The API may be equivalent, but it is still conceptually secondary. If it went stale, readers would still see the site, and it makes sense for a scraper to follow what readers can see (or alternately to consume both, and mine both).
The author might be right to be annoyed with the scrapers for many other reasons, but I don't think this is one of them.