Preferences

I’ve been a web developer for decades as well as doing scraping, indexing, and analyzing million of sites.

Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.

This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.

As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.


I worked at one place where it probably cost us 100x (in CPU) more to serve content the way we were doing it as opposed to the way most people would do it. We could afford it because it was still cheap, but we deferred the cost reduction work for half a decade and went on a war against webcrawlers instead. (hint: who introduced the robots.txt standard?)
These people think they're on the verge of the most important invention in modern history. Etiquette means nothing to them. They would probably consider an impediment to their work a harm to the human race.
>They would probably consider an impediment to their work a harm to the human race.

They do. Marc Andreeson said as much in his "techno-optimist manifesto," that any hesitation or slowdown in AI development or adoption is equivalent to mass murder.

I want to believe he's bullshitting to hype it up for profit because at least that's not as bad as if it was sincere.
It's not just for profit, it's to save some future, mythical version of humankind! https://netzpolitik.org/2023/longtermism-an-odd-and-peculiar...
I also want to believe AI hype is being fueled just by grifters and not an accelerationist messiah cult (and also grifters.) But they do seem really committed to the bit.
Yeah but it’s just shit engineering. They re-crawl entire sites basically continuously absent any updates or changes. How hard is it to cache a fucking sitemap for a week?

It’s a waste of bandwidth and CPU on their end as well, “the bitter lesson” isn’t “keep duplicating the same training data”.

I’m glad DeepSeek is showing how inefficient and dogshit most of the frontier model engineer is - how much VC is getting burned literally redownloading a copy of the entire web daily when like <1% of it is new data.

I get they have no shame economically, that they are deluded and greedy. But bad engineering is another class of sin!

We've had to block a lot of these bots as they slowed our technical forum to a crawl, but new ones appear every now and again. Amazons was the worst
I really wonder if these dogshit scrapers are wholly built by LLM. Nobody competent codes like this.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal