Preferences

I have little sympathy for the company in this article. If you put your content on the web, and don't require authentication to access it, it's going to be crawled and scraped. Most of the time you're happy about this — you want search providers to index your content.

It's one thing if a company ignores robots.txt and causes serious interference with the service, like Perplexity was, but the details here don't really add up: this company didn't have a robots.txt in place, and although the article mentions tens/hundreds of thousands of requests, they don't say anything about them being made unreasonably quickly.

The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.

EDIT: They're a very media-heavy website. Here's one of the product pages from their catalog: https://triplegangers.com/browse/scans/full-body/sara-liang-.... Each of the body-pose images is displayed at about 35x70px but is served as a 500x1000px image. It now seems like they have some cloudflare caching in place at least.

I stand by my belief that unless we get some evidence that they were being scraped particularly aggressively, this is on them, and this is being blown out of proportion for publicity.


> I have little sympathy for the company in this article. If you put your content on the web, and don't require authentication to access it, it's going to be crawled and scraped. Most of the time you're happy about this — you want search providers to index your content.

If I stock a Little Free Library at the end of my driveway, it's because I want people in the community to peruse and swap the books in a way that's intuitive to pretty much everyone who might encounter it.

I shouldn't need to post a sign outside of it saying "Please don't just take all of these at once", and it'd be completely reasonable for me to feel frustrated if someone did misuse it -- regardless of whether the sign was posted or not.

There is nothing inherently illegal about filling a small store to occupancy capacity with all of your friends and never buying anything.

Just because something is technically possible and not illegal does NOT make it the right thing to do.

as the saying goes "it's not illegal" is a very low bar for morality.
From the Wayback Machine [0] it seems they had a normal "open" set-up. They wanted to be indexed, but it's probably a fair concern that OpenAI isn't going to respect their image license. The article describes the robot.txt [sic] now "properly configured", but their solution was to block everything except Google, Bing, Yahoo, DuckDuckGo. That seems to be the smart thing these days, but it's a shame for any new search engines.

[0] https://web.archive.org/web/20221206134212/https://www.tripl...

The argument about image/content licensing is, I think, distinct from the one about how scrapers should behave. I completely agree that big companies running scrapers should be good citizens — but people hosting content on the web need to do their part, too. Again, without any details on the timing, we have no idea if OpenAI made 100k requests in ten seconds or if they did it over the course of a day.

Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.

> Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.

They publicly published the site for their customers to browse, with the side benefit that curious people could also use the site in moderation since it wasn't affecting them in any real way. OpenAI isn't their customer, and their use is affecting them in terms of hosting costs and lost revenue from downtime.

The obvious next step is to gate that data behind a login, and now we (the entire world) all have slightly less information at our fingertips because OpenAI did what they do.

The point is that OpenAI, or anyone doing massive scraping ops should know better by now. Sure, the small company that doesn't do web design had a single file misconfigured, but that shouldn't be a 4 or 5 figure mistake. OpenAI knows what bandwidth costs. There should be a mechanism that says, hey, we have asked for many gigabytes or terrabytes of data from a single domain scrape, that is a problem.

> If you put your content on the web, and don't require authentication to access it, it's going to be crawled and scraped. Most of the time you're happy about this — you want search providers to index your content

> The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.

These two statements are at odds, I hope you realize. You say public accessibility of information is a good thing, while blaming someone for being effectively DDOS'd as a result of having said information public.

They're not at odds. "default-public accessibility of information" doesn't necessarily translate into "default-public accessibility of content" ie. media. Content should be served behind an authentication layer.

The clickbaity hysteria here is missing out how this sort of scraping has been possible long before AI agents showed up a couple of years back.

Of course it was possible, but the incentives have changed. Now anyone can use the accumulated knowledge of the world to build something new, so more independent actors are doing so, often very badly.
> although the article mentions tens/hundreds of thousands of requests, they don't say anything about them being made unreasonably quickly.

It's the first sentence of the article.

On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down.

If a scraper is making enough requests to take someone else's website down, the scraper's requests are being made unreasonably quickly.

robots.txt as of right now is a complete honor system, so I think it's reasonable to make a conclusion that you shouldn't rely on it protecting you because odds are overwhelming that scraping behavior will become worse in the near to mid term future
Let us flip this around: If your crawler regularly knocks websites offline, you've clearly done something wrong.

There's no chance every single website in existence is going to have a flawless setup. That's guaranteed simply from the number of websites, and how old some of them are.

It's less about sympathy and more about understanding that they might not be experts in things tech, relied on hired help that seemed to be good at what they did, and the most basic thing (setup a free cloudflare account or something) was missed.

Learning how, is sometimes actually learning who's going to get you online in a good way.

In this case when you have non-tech people building Wordpress sites, it's about what they can understand and do, and teh rate of learning doesn't always keep up relative to client work.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal