Comment by efilife - Hacker Neue

efilife 5 days ago parent

> Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. ... There isn't much you can do about it without cloudflare

I'm sorry, what? I can't believe I am reading this on HackerNews. All you have to do is code your own, BASIC captcha-like system. You can just create a page that sets a cookie using JS and check on the server whether it exists. 99.9999% of these scrapers can't execute JS and don't support cookies. You can go for a more sophisticated approach and analyze some more scraper tells (like reject short useragents). I do this and NEVER had a bot get past this and not a single user ever complained. It's extremely simple, I should ship this and charge people if no one seems to be able to figure this out by themselves.

n1xis10t 4 days ago

Oops you just leaked your own intellectual property

ATechGuy 4 days ago

From ChatGPT:

This approach can stop very basic scripts, but the claim that “99.9999% of scrapers can’t execute JS or handle cookies” isn’t accurate anymore. Modern scraping tools commonly use headless browsers (Playwright, Puppeteer, Selenium), execute JavaScript, support cookies, and spoof realistic user agents. Any scraper beyond the most trivial will pass a JS-set cookie check without effort. That said, using a lightweight JS challenge can be reasonable as one signal among many, especially for low-value content and when minimizing user friction is a priority. It’s just not a reliable standalone defense. If it’s working for you, that likely means your site isn’t a high-value scraping target — not that the technique is fundamentally robust.

efilife OP 4 days ago

From someone who actually does this stuff:

The claim is very accurate. Maybe not for the biggest websites, but very accurate for a self-hosted blog. You are not that important to waste compute power to set up a whole ass headless browser to scrape your page. Why am I even arguing with ChatGPT?

andersmurphy 4 days ago

Yup another trick is to only serve br compressed resources and serve nothing to clients that don't support brotli. A lot of http clients don't support brotli out of the box.

I take it further and only stream content to clients that have a cookie, support js and br. Otherwise all you get is a minimal static pre br compressed shim. Seems to work well enough.

phyzome 4 days ago

There should be a new rule on HN: No posts that just go "I asked an LLM and it said..."

You're not adding anything to the conversation.

cyphar 4 days ago

Yeah, I really have to wonder what the thought process is behind leaving such a comment. When people first started doing it I wondered if it was some kind of guerrilla outrage marketing campaign.

PunchyHamster 4 days ago

There was no thought process

efilife OP 4 days ago

Maybe he wanted to verify whether what I was saying was true and asked ChatGPT, then tried to be helpful by pasting the response here?

cyphar 4 days ago

Maybe I'm getting too jaded but I'm struggling to be quite that charitable.

The entireity of the human-written text in that comment was "From ChatGPT:" and it was formatted as though it was a slam-dunk "you're wrong, the computer says so" (imagine it was "From Wikipedia" followed by a quote disagreeing with you instead).

I'm sure some people do what you describe but then I would expect at least a little bit more explanation as to why they felt the need to paste a paragraph of LLM output into their comment. (While I would still disagree that it is in any way valuable, I would at least understand a bit about what they are trying to communicate.)

phyzome 4 days ago

Yeah, I agree that that's likely the thought process. It just happens to be the opposite of helpful.

6031769 4 days ago

So an LLM says that a technique used to foil LLM scrapers is ineffective against LLM scrapers.

It's almost as if it might have an ulterior motive in saying so.

This item has no comments currently.