Preferences

I've built crawlers that retrieve billions of web pages every month. We had a whole team working modifying the crawlers to resolve website changes, to reverse engineer ajax requests and solve complex problems like captcha solvers. Bottom line, if someone wants to crawl your website they will.

What you can do, however, is make it hard so that the vast majority of developers can't do it (e.g. My tech crawl billions of pages, but there was a whole team dedicated to keeping it going). If you have money to spend, there's Distill Networks or Incapsula that have good solutions. They block PhantomJS and browsers that use Selenium to navigate websites, as well as rate limit the bots.

What I found really affective that some websites do is tarpit bots. That is, slowly increase the number of seconds it takes to return the http request. So after a certain amount of request to your site it takes 30+ seconds for the bot to get the HTML back. The downside is that your web servers need to accept many more incoming connections but the benefit is you'll throttle the bots to an acceptable level.

I currently run a website that gets crawled a lot, deadheat.ca. I've written a simple algorithm that tarpits bots. I also throw a captcha every now and then when I see an IP address hits too often over a span of a few minutes. The website is not super popular and, in my case, it's pretty simple to differentiate between a human or bot.

Hope this helps...


Do you feel bad at all about apparently making a business out of crawlers, but still apparently viewing it as bad enough that you want countermeasures against it? Don't you feel a slight bit hypocritical about this?
I don't think I'm being hypocritical. I have no issues if people crawl my site, I even identify who they are and give them access to a private API. I do not generate any income from that website though. I provide a service because I love doing it and I cover all the costs. Bots do increase my cost so I choose to limit their activity. Crawl me, but do so using my rules.

One of the alternatives is charging for my service but bots are not my users problem, they are mine.

> I've built crawlers that retrieve billions of web pages every month.

Wow, what were you doing with the data?

Competitive intelligence.

Crawling thousands of website, mashing up the data to analyze competitiveness between them, and selling it back.

For example, cost of flights. Different websites provide different prices for the same flight. The technology crawls all the prices, combines the data, then resells it back to the websites. Everyone knows everyones prices, keeps competition high, lower prices for consumers.

Travel companies pay their GDS for every search they do. It costs so much that it's the primary cost centre for some of them. You were costing them thousands of dollars a day.
If unwanted scraping can be distinguished from legitimate traffic, wouldn't a sort of honeypot strategy work such that you then provide those requests you've identified as likely to be unwelcome with fake or divergent data?
When websites get a ton of traffic the concern is that the algorithm that find fake data will not be accurate and start blocking paying customers. So it's a fine line between blocking paying customers and fake data. What these algorithms do instead of blocking is to throw captcha's so if the traffic is really human, the captcha can be solved. The bigger problem is that there's a good chance that humans who are thrown a captcha will leave to go buy somewhere else (either because they are lazy, the captcha is hard, etc...).

Solutions like CloudFare and Distill have sophisticated algorithm to balance out fake and real data but even they are not close to being perfect.

How do you prevent your website from not functioning over time for legitimate users, though? I'm a Sysadmin & not a coder or developer, so the tricks you can do are a little foreign to me. Can you provide examples? Why don't Adidas/Nike/et. al. do this to fight the likes of sneakerbots?
Did you respect robots.txt?
It sounds to me like an obvious no, if they have a large team to get around countermeasures.
Considering the effort that went into it.

I am pretty sure crawling robot.txt links was their P1 requirement.

+1 for incapsula or cloudflare.

BTW, interested in learning a bit more about your stack, we are on the same route but at smaller scale.

Very curious about this type of work. Is there is a good way to contact you to discuss this topic?
How do you bypass google recaptcha
I can't provide details on any innovations we've done with sites like google, but in general if you want to crawl google you'll want to get "many, many" IP addresses. I've heard of people using services like 2captcha.com but the best way is to obfuscate who you are.

If you can hit Google 60 times per minute per IP before getting blocked and you need to crawl them 1000 times per minute, you need 17 IPs per hour. Randomize headers to look like real people coming from schools, office buildings, etc... Lots of work but possible.

I do it using rotating proxies, stripping cookies between requests, randomly varying the delay between requests, randomly selecting a valid user-agent string, etc. It's a pain in the butt. And to scrape more than I do, faster than I do, would be pretty freaking expensive in terms of time and money.

Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.

If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).

If you do go the ML route, I recommend TensorFlow + Google Cloud (Both for the cost performance, and the irony).
There are services that do this with humans for pennies. (A service I've used charges $2/1000)
Mechanical turk

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal