The infra for a decent crawl is prohibitive. There's a bit of black magic in crawl scheduling, and a bit in de-duplication, but most of the challenge is in scale.
I used to work on Google's indexing system, and sat with the guys who wrote the Percolator system that basically used BigTable triggers to drive indexing and make it less batch-oriented.
I know France has made at least a couple of attempts at a government-funded "Google killer" search engine. I think it would be a better use of government money to make something like a government-run event-driven first-level indexing system where search engine companies could pay basically cloud computing costs to have their proprietary triggers populate their proprietary databases based on the government-run crawling and first-level analysis. When one page updates, you'd want all of the search engine startups running their triggers on the same copy of the data, rather than having to stream the data out to each of the search engine startups.
Basically, you want to take some importance metric, some estimate of the probability some content has changed since the last time you crawled it, combine the product of the two plus some additional constraints (crawl every known page at least some maximum period, don't hit any domain too hard, etc.) as a crawl priority. You then crawl the content, convert HTML, PDF, etc. to some marked-up text format (UTF-8 HTML isn't bad, but I think UTF-8 plain text plus some separate annotations in a binary format would be better). You strip out text that's too small or too close to the background color. You calculate one or more locality-sensitive hash functions over the plain text, cluster similar texts, pick a canonical URL for each cluster. You calculate the directed link graph across clusters. The PageRank patent has expired, so you could calculate PageRank and several other link-graph ranking signals across canonical clusters. You'd presumably compute some uniqueness scores, age scores, etc. for each canonical URL, and then in parallel run each of the search engine startup's analysis over this package of analysis data each time you find a change for a particular canonical URL.
You might have some startups providing spam scoring or other analysis and providing that (for fees, of course) to search engine startups, etc. Basically, you want to modularize the indexing and analysis to provide competition and nearly seamless transition between competing providers within your ecosystem.
I think that's the way to drive innovation in the search engine startup space and properly leverage economies of scale across search engine startups.