Preferences

Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.

I called it when I wrote it, they are just burning their goodwill to the ground.

I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.


> just burning their goodwill to the ground

AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.

Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?

"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.

> Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet.

And this is why AI training is not "fair use". The AI companies seek to train models in order to compete with the authors of the content used to train the models.

A possible eventual downfall of AI is that the risk of losing a copyright infringement lawsuit is not going away. If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.

I've pointed this out to a few people in this space. They tend to suggest that the value in AI is so great this means we should get rid of copyright law entirely.
That value is only great if it's shared equitably with the rest of the planet.

If it's owned by a few, as it is right now, it's an existential threat to the life, liberty, and pursuit of a happiness of everyone else on the planet.

We should be seriously considering what we're going to do in response to that threat if something doesn't change soon.

That's a talking point for bros looking to exploit it as their ticket.

"The upside of my gambit is so great for the world, that I should be able to consume everyone else's resources for free. I promise to be a benevolent ruler."

"What's good for Milo Minderbinder is good for the world."
…meaning that whatever model results would have no protection, and would be free for anyone to use?
That's not how conservatism works. AI oligarchs are part of the "in" group in the "there are laws that protect but do not bind the in group, and laws that bind but do not protect the out group" summary. Anyone with a net worth less than FOTUS is part of the "out" group.
AI is worthless without training data. If all content becomes AI generated because AI outcompetes original content then there will be no data left to train on.

When Google first came out in 1998, it was amazing, spooky how good it was. Then people figured out how to game pagerank and Google's accuracy cratered.

AI is now in a similar bubble period. Throwing out all of copyright law just for the benefit of a few oligarchs would be utter foolishness. Given who is in power right now I'm sure that prospect will find a few friends, but I think the odds of it actually happening before the bubble bursts are pretty small.

Are we not past past critical mass though? The velocity at which these things can out compete human labor is astonishing, any future human creations or original content will already have lost the battle the moment it goes online and gets cloned by AI.
We should, but not for those reasons.

If software and ideas become commodities and the legal ecosystem around creating captive markets disappears, then we will all be much better off.

I'm doubtful the AI companies would be happy with getting rid of laws protecting _their_ intellectual property.
What an infantile worldview.
There's already been an interesting ruling that a pure AI output is not, in itself, copyrightable.
I'm curious what will happen when someone modifies a single byte (or a "sufficient" number of bytes) of AI output, thereby creating a derivative work, and then claiming copyright on that modified work.
> The AI companies seek to train models in order to compete with the authors of the content used to train the models.

When I read someone else’s essay I may intend to write essays like that author. When I read someone else’s code I may intend to write code like that author.

AI training is no different from any other training.

> If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.

Do you mean the output of the AI training process (the model), or the output of the AI model? If the former, yes, sure: if a model actually contains within it it copies of data, then sure: it’s a copy of that work.

But we should all be very wary of any argument that the ability to create a new work which is identical to a previous work is itself derivative. A painter may be able to copy Gogh, but neither the painter’s brain nor his non-copy paintings (even those in the style of Gogh) are copies of Gogh’s work.

If you as an individual recognizably regurgitate the essay you read, then you have infringed. If an AI model recongnizably regurgitates the essay it trained on then it has infringed. The AI argument that passing original content through an algorithm insulates the output from claims of infringement because of "fair use" is pigwash.
> AI training is no different from any other training.

Yes, it is. One is done by a computer program, and one is done by a human.

I believe in the rights and liberties of human beings. I have no reason to believe in rights for silicon. You, and every other AI apologist, are never able to produce anything to back up what is largely seen as an outrageous world view.

You cannot simply jump the gun and compare AI training to human training like it's a foregone conclusion. No, it doesn't work that way. Explain why AI should have rights. Explain if AI should be considered persons. Explain what I, personally, will gain from extending rights to AI. And explain what we, collectively, will gain from it.

Outcomes matter. Things that are fine at an individual level can become social harmful at scale.
What happens when a culture becomes overwhelmingly individualistic and becomes blind to the at-scale harms?
I have this line of thought as well but then I wonder, if we are all out of jobs and out of substantial capital to spend, how do these owners make money ultimately? It's a genuine question and I'm probably missing something obvious. I can see a benevolant/post-scarcity spin to this but the non-benevolant one seems self defeating.
"Making money" is only a relevant goal when you need money to persuade humans to do things for you.

Once you have an army of robot slaves ... you've rendered the whole concept of money irrelevant. Your skynet just barters rare earth metals with other skynets and your robot slaves furnish your desired lifestyle as best they can given the amount of rare earth metals your skynet can get its hands on. Or maybe a better skynet / slave army kills your skynet / slave army, but tough tits, sucks to be you and rules to be whoever's skynet killed yours.

Good thing AI for now needs power, water and a place to exchange heat. Our version of womp rats if it goes to far I guess.
That's part of the "rare earth metals" synecdoche - hydroelectric dams, thorium mines, great lakes heat sinks - they're all things for skynets to kill or barter for as expedient
I don’t think you’re missing anything, I think the plan really is to burn it all down and rule over the ashes. The old saw “if you’re so smart, why aren’t you rich?” works in reverse too. This is a foolish, shortsighted thing to do, and they’re doing it anyway. Not really thinking about where value actually comes from or what their grandchildren’s lives would be like in such a world.
Capitalism is an unthinking, unfeeling force. The writing is on the wall that AI is coming, and being altruistic about it doesn’t do jack to keep others from the land grab. Their thinking is, might as well join the rush and hope they’re one of the winners. Every one of us sitting on the sidelines will be impacted in some way or the other. So who’re the smart ones, the ones who grab shovels and start digging, or the ones who watch as the others dig their graves and do nothing?
I think the obvious thing you are missing is just b2b. It doesn’t actually matter if people have any money.

Similar to how advertising and legal services are required for everything but have ambiguous ROI at best, AI is set to become a major “cost of doing business“ tax everywhere. Large corporations welcome this even if it’s useless, because it drags down smaller competitors and digs a deeper moat.

Executives large and small mostly have one thing in common though.. they have nothing but contempt for both their customers and their employees, and would much rather play the mergers and acquisitions type of games than do any real work in their industry (which is how we end up in a world where the doors are flying off airplanes mid flight). Either they consolidate power by getting bigger or they get a cushy exit, so.. who cares about any other kind of collateral damage?

Money is a proxy for control. Eventually humans will become mostly redundant and slated for elimination except for the chosenites of the managerial classes and a small number of technicians. Either through biological agents, famines, carefully engineered (civil?) wars and conflicts designed to only exterminate the non-managerial classes, or engineered Calhounian behavioral sinks to tank fertility rates below replacement.
Ssssh, you can't say that. Those types of brain damage are protected diversity.
Why should we care if they make money? Owning things isn't a contribution to society.

Building things IS a contribution to society, but the people who build things typically aren't the ultimate owners. And even in cases where the builders and owners are the same, entitling the builders and all of their future heirs to rent seek for the rest of eternity is an inordinate reward.

You don't. It's like Minecraft. You can do almost everything in Minecraft alone and everything exists in infinite quantity, so why trade in the first place?

This goes both ways. Let's say there is something you want but you're having trouble obtaining it. You'd need to give something in exchange.

But the seller of what you want doesn't need the things you can easily acquire, because they can get those things just as easily themselves.

The economy collapses back into self sufficiency. That's why most Minecraft economy servers start stagnating and die.

Unfortunately I don’t think the logic extends beyond “if we don’t do it, someone else will”. Anything after that is secondary.
What people say is not the same as what people do.. in other words, what is spoken in public repeatedly is not representational of actual decision flows
Money is only a bookkeeping tool for complex societies. The aim of the owner class in a worker-less world would be accumulation of important resources to improve their lives and to trade with other owners (money would likely still be used for bookkeeping here). A wealthy resource-owner might strive to maintain a large zone of land, defended by AI weaponry, that contains various industrial/agricultural facilities producing goods and services via AI.

They would use some of the goods/services produced themselves, and also trade with other owners to live happy lives with everything they need, no workers involved.

Non-owners may let the jobless working class inhabit unwanted land, until they change their minds.

Better for them to give us jobs so we owe them and are less likely to revolt!
One (satirical) answer to this question is given in Greg Egan's "The Discrete Charm of the Turing Machine" (2017). https://i.4pcdn.org/tg/1599529933107.pdf
The non-benevolent future is not self-defeating; we have historical examples of depressingly stable economies with highly concentrated ownership. The entirety of the European dark ages was the end result of (western[0]) Rome's elites tearing the planks out of the hull of the ship they were sailing. The consequence of such a system is economic stagnation, but that's not a consequence that the elites have to deal with. After all, they're going to be living in the lap of luxury, who cares if the economy stagnates?

This economic relationship can be collectively[1] described as "feudalism". This is a system in which:

- The vast majority of people are obligated to perform menial labor, i.e. peasant farmers.

- Class mobility is forbidden by law and ownership predominantly stays within families.

- The vast majority of wealth in the economy is in the form of rents paid to owners.

We often use the word "capitalist" to describe all businesses, but that's a modern simplification. Businesses can absolutely engage in feudalist economies just as well, or better, than they can engage in capitalist ones. The key difference is that, under capitalism, businesses have to provide goods or services that people are willing to pay for. Feudalism makes no such demand; your business is just renting out a thing you own.

Assuming AI does what it says on the tin (which isn't at all obvious), the endgame of AI automation is an economy of roughly fifty elite oligarchs who own the software to make the robots that do all work. They will be in a constant state of cold war, having to pay their competitors for access to the work they need done, with periodic wars (kinetic, cyber, legal, whatever) being fought whenever a company intrudes upon another's labor-enclave.

The question of "well, who pays for the robots" misunderstands what money is ultimately for. Money is a token that tracks tax payments for coercive states. It is minted specifically to fund wars of conquest; you pay your soldiers in tax tokens so the people they conquer will have to barter for money to pay the tax collector with[2]. But this logic assumes your soldiers are engaging in a voluntary exchange. If your 'soldiers' are killer robots that won't say no and only demand payment in energy and ammunition, then you don't need money. You just need to seize critical energy and mineral reserves that can be harvested to make more robots.

So far, AI companies have been talking of first-order effects like mass unemployment and hand-waving about UBI to fix it. On a surface level, UBI sounds a lot like the law necessary to make all this AI nonsense palatable. Sam Altman even paid to have a study done on UBI, and the results were... not great. Everyone who got money saw real declines in their net worth. Capital-c Conservative types will get a big stiffy from the finding that UBI did lead people to work less, but that's only part of the story. UBI as promoted by AI companies is bribing the peasants. In the world where the AI companies win, what is the economic or political restraining bolt stopping the AI companies from just dialing the UBI back and keeping more of the resources for themselves once traditional employment is scaled back? Like, at that point, they already own all the resources and the means of production. What makes them share?

[0] Depending on your definition of institutional continuity - i.e. whether or not Istanbul is still Constantinople - you could argue the Roman Empire survived until WWI.

[1] Insamuch as the complicated and ideosyncratic economic relationships of medieval Europe could even be summed up in one word.

[2] Ransomware vendors accidentally did this, establishing Bitcoin (and a few other cryptos) as money by demanding it as payment for a data ransom.

You may find "Technofeudalism: What Killed Capitalism" book to your liking.
And how could they possibly base their actions on good when their technology is more important than fire? History is depending on them to do everything possible to increase their market cap.
Careful, I think you're being sarcastic, but you're in a space where a lot of people believe what you just said unironically.
Ha! Comment of the week.
More important than fire? AI runs on fire.
> The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained.

I agree with you in the case of AI companies, but the desire to own everything an bee completely unconstrained is the dream of every large corporation.

> remake the entire world into one where the owners of these companies own everything and are completely unconstrained

how has this been any different from the past 10,000 years of human conquest and domination?

in the past, you had to give some of your spoils to those who did the conquering for you, and laborers after that. if you can automate and replace all work, including maintening the robots that do that and training them, you no longer need to share anything.
In my view it's the same thing, same trajectory -- with more power in the hands of fewer people further along the trajectory.

It can be better or worse depending on what those with power choose to do. Probably worse. There has been conquest and domination for a long time, but ordinary people have also lived in relative peace gathering and growing food in large parts of the world in the past, some for entire generations. But now the world is rapidly becoming unable to support much of that as abundance and carrying capacity are deleted through human activity. And eventually the robot armies controlled by a few people will probably extract and hoard everything that's left. Hopefully in some corners some people and animals can survive, probably by being seen as useful to the owners.

On the bright side, armies of robot slaves give us an off-ramp from the unsustainable pyramid scheme of population growth.

Be fruitful, and multiply, so that you may enjoy a comfortable middle age and senescence exploiting the shit out of numerous naive 25-year-olds! If it's robots, we can ramp down the population of both humans and robots until the planet can once again easily provide abundance.

The thing is that this will be their destruction as well. If workers don't have any money (because they don't have jobs), nobody can afford what the owners have to sell?
The human population will be decimated just as the work horse population was.
They are also gutting the profession of software engineering. It's a clever scam actually: to develop software a company will need to pay utility fees to A"I" companies and since their products are error prone voila use more A"I" tools to correct the errors of the other tools. Meanwhile software knowledge will atrophy and soon ala WALE we'll have software "developers" with 'soft bones' floating around on conveyed seats slurping 'sugar water' and getting fat and not knowing even how to tie their software shoelaces.
Yes, like the Pixel camera app, which mangles photos with AI processing, and users complain that it won't let people take pics.

One issue was a pic with text in it, like a store sign. Users were complaining that it kept asking for better focus on the text in the background, before allowing a photo. Alpha quality junk.

Which is what AI is, really.

AI tarpits && lim (human curated contant/mediocre AI answers -> 0) = AI's crumbling into dust by themselves.
We, the people, might need to come up with a few proverbial tranquilizer guns here soon
Maxim 1: "Pillage, then burn."
Another Schlock Mercenary fan? Or does this adage have many adherents?
The adage predates the longest continuous webcomic, but not as a maxim.
Yep, a fan I am.
That's pretty much what our future would look like -- you are irrelevant. Well I mean we are already pretty much irrelevant nowadays, but the more so in the "progressive" future of AI.
Rules and laws are for other people. A lot of people reading this comment having mistaken "fake it til you make it" or "better to not ask permission" for good life advice are responsible for perpetrating these attitudes, which are fundamentally narcissistic.
"... you have the lawyers clean it all up later." - Eric Schmidt
> AI will be incorporated into all products whether you like it or not

AI will be incorporated into the government, whether you like it or not.

FTFY!

I think the logic is more like “we have to do everything we can to win or we will disappear”. Capitalism is ruthless and the big techs finally have some serious competition, namely: each other as well as new entrants.

Like why else can we just spam these AI endpoints and pay $0.07 at the end of the month? There is some incredible competition going on. And so far everyone except big tech is the winner so that’s nice.

> One crawler downloaded 73 TB of zipped HTML files in May 2024 [...] This cost us over $5,000 in bandwidth charges

I had to do a double take here. I run (mostly using dedicated servers) infrastructure that handles a few hundred TB of traffic per month, and my traffic costs are on the order of $0.50 to $3 per TB (mostly depending on the geographical location). AWS egress costs are just nuts.

I think uncontrolled price of cloud traffic - is a real fraud and way bigger problem then some AI companies that ignore robot.txt. One time we went over limit on Netlify or something, and they charged over thousand for a couple TB.
> I think uncontrolled price of cloud traffic - is a real fraud

Yes, it is.

> and way bigger problem then some AI companies that ignore robot.txt.

No, it absolutely is not. I think you underestimate just how hard these AI companies hammer services - it is bringing down systems that have weathered significant past traffic spikes with no issues, and the traffic volumes are at the level where literally any other kind of company would've been banned by their upstream for "carrying out DDoS attacks" months ago.

>I think you underestimate just how hard these AI companies hammer services

Yeas, I completely don't understand this and don't understand comparing this with ddos attacks. There's no difference with what search engines are doing, and in some way it's worse? How? It's simply scraping data, what significant problems may it cause? Cache pollution? And thats'it? I mean even when we talking about ignoring robots.txt (which search engines are often doing too) and calling costly endpoints - what is the problem to add to those endpoints some captcha or rate limiters?

>which I then emailed 3x and never got a reply.

Send a bill to their accounts payable team instead.

Detect AI scraper and inject an in-page notice that by continuing they accept your terms of use.

Terms of use charges them per page load in some terminology of abuse.

Profit... By sending them invoices :-)

Honestly this is crazy enough to work. Bonus points if both you and the scraping company reside in the same state.
>> which I then emailed 3x and never got a reply.

At which point does the crawling cease to be a bug/oversight and constitute a DDOS?

Maybe just feed them dynamically generated garbage information? More fun than no information.
OP’s linked blog post mentioned they got hit with a large spike in bandwidth charges. Sending them garbage information costs money.
Yeah you have a point, hmmm, wish there were a way to somehow generate those garbages with minimum bandwidth. Something like, I can send you a very compressed 256 bytes of data which expands to something like 1 mega bytes.
there is -- but instead of garbage expanding data, add in several delays within the response so that the data takes extraordinarily long

Depending on the number of simultaneous requesting connections, you may be able to do this without a significant change to your infrastructure. There are ways to do it that don't exhaust your number of (IP, port) available too, if that is an issue.

Then the hard part is deciding which connections to slow, but you can start with a proportional delay based on the number of bytes per source IP block or do it based on certain user agents. Might turn into a small arms race but it's a start.

Tarpit instead? Trickle out a dead end response (no links) at bytes-per-second speeds until the bot times out.

https://en.wikipedia.org/wiki/Tarpit_(networking)

It does not even have to be dynamically generated. Just pre-generate a few thousand static pages of AI slop and serve that. Probably cheaper than dynamic generation.
I kind of suspect some of these companies probably have more horsepower and bandwidth in one crawler than a lot of these projects have in their entire infrastructure.
Thanks for writing about this. Is it clear that this is from crawlers, as opposed to dynamic requests triggered by LLM tools, like Claude Code fetching docs on the fly?
Along with having block lists, perhaps you could add poison to your results that generates random bad code that will not work, and that is only seen by bots (display: none when rendered), and the bots will use it, but a human never would.
Wondering if used tried stopping such bots with Captcha?

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal