One of the alternatives is charging for my service but bots are not my users problem, they are mine.
Wow, what were you doing with the data?
Crawling thousands of website, mashing up the data to analyze competitiveness between them, and selling it back.
For example, cost of flights. Different websites provide different prices for the same flight. The technology crawls all the prices, combines the data, then resells it back to the websites. Everyone knows everyones prices, keeps competition high, lower prices for consumers.
Solutions like CloudFare and Distill have sophisticated algorithm to balance out fake and real data but even they are not close to being perfect.
BTW, interested in learning a bit more about your stack, we are on the same route but at smaller scale.
If you can hit Google 60 times per minute per IP before getting blocked and you need to crawl them 1000 times per minute, you need 17 IPs per hour. Randomize headers to look like real people coming from schools, office buildings, etc... Lots of work but possible.
Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.
If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).
What you can do, however, is make it hard so that the vast majority of developers can't do it (e.g. My tech crawl billions of pages, but there was a whole team dedicated to keeping it going). If you have money to spend, there's Distill Networks or Incapsula that have good solutions. They block PhantomJS and browsers that use Selenium to navigate websites, as well as rate limit the bots.
What I found really affective that some websites do is tarpit bots. That is, slowly increase the number of seconds it takes to return the http request. So after a certain amount of request to your site it takes 30+ seconds for the bot to get the HTML back. The downside is that your web servers need to accept many more incoming connections but the benefit is you'll throttle the bots to an acceptable level.
I currently run a website that gets crawled a lot, deadheat.ca. I've written a simple algorithm that tarpits bots. I also throw a captcha every now and then when I see an IP address hits too often over a span of a few minutes. The website is not super popular and, in my case, it's pretty simple to differentiate between a human or bot.
Hope this helps...