If I stock a Little Free Library at the end of my driveway, it's because I want people in the community to peruse and swap the books in a way that's intuitive to pretty much everyone who might encounter it.
I shouldn't need to post a sign outside of it saying "Please don't just take all of these at once", and it'd be completely reasonable for me to feel frustrated if someone did misuse it -- regardless of whether the sign was posted or not.
Just because something is technically possible and not illegal does NOT make it the right thing to do.
[0] https://web.archive.org/web/20221206134212/https://www.tripl...
Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.
They publicly published the site for their customers to browse, with the side benefit that curious people could also use the site in moderation since it wasn't affecting them in any real way. OpenAI isn't their customer, and their use is affecting them in terms of hosting costs and lost revenue from downtime.
The obvious next step is to gate that data behind a login, and now we (the entire world) all have slightly less information at our fingertips because OpenAI did what they do.
The point is that OpenAI, or anyone doing massive scraping ops should know better by now. Sure, the small company that doesn't do web design had a single file misconfigured, but that shouldn't be a 4 or 5 figure mistake. OpenAI knows what bandwidth costs. There should be a mechanism that says, hey, we have asked for many gigabytes or terrabytes of data from a single domain scrape, that is a problem.
> The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.
These two statements are at odds, I hope you realize. You say public accessibility of information is a good thing, while blaming someone for being effectively DDOS'd as a result of having said information public.
The clickbaity hysteria here is missing out how this sort of scraping has been possible long before AI agents showed up a couple of years back.
It's the first sentence of the article.
On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down.
If a scraper is making enough requests to take someone else's website down, the scraper's requests are being made unreasonably quickly.
There's no chance every single website in existence is going to have a flawless setup. That's guaranteed simply from the number of websites, and how old some of them are.
Learning how, is sometimes actually learning who's going to get you online in a good way.
In this case when you have non-tech people building Wordpress sites, it's about what they can understand and do, and teh rate of learning doesn't always keep up relative to client work.
It's one thing if a company ignores robots.txt and causes serious interference with the service, like Perplexity was, but the details here don't really add up: this company didn't have a robots.txt in place, and although the article mentions tens/hundreds of thousands of requests, they don't say anything about them being made unreasonably quickly.
The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.
EDIT: They're a very media-heavy website. Here's one of the product pages from their catalog: https://triplegangers.com/browse/scans/full-body/sara-liang-.... Each of the body-pose images is displayed at about 35x70px but is served as a 500x1000px image. It now seems like they have some cloudflare caching in place at least.
I stand by my belief that unless we get some evidence that they were being scraped particularly aggressively, this is on them, and this is being blown out of proportion for publicity.