Comment by soared - Hacker Neue

soared Aug 15, 2024 parent

2 is not how the US breaks up monopolies

8note Aug 15, 2024

The US can split the crawler yesterday into its own business though, which could sell access to that index and the execute training operations over it

KMag Aug 15, 2024

Right. Government-mandated access to proprietary data isn't how the US breaks up monopolies, but somewhat along those lines, it might make sense for some government to provide some similar data. This seems much closer to a European style government approach, and I wouldn't expect such a thing in the U.S.

The infra for a decent crawl is prohibitive. There's a bit of black magic in crawl scheduling, and a bit in de-duplication, but most of the challenge is in scale.

I used to work on Google's indexing system, and sat with the guys who wrote the Percolator system that basically used BigTable triggers to drive indexing and make it less batch-oriented.

I know France has made at least a couple of attempts at a government-funded "Google killer" search engine. I think it would be a better use of government money to make something like a government-run event-driven first-level indexing system where search engine companies could pay basically cloud computing costs to have their proprietary triggers populate their proprietary databases based on the government-run crawling and first-level analysis. When one page updates, you'd want all of the search engine startups running their triggers on the same copy of the data, rather than having to stream the data out to each of the search engine startups.

Basically, you want to take some importance metric, some estimate of the probability some content has changed since the last time you crawled it, combine the product of the two plus some additional constraints (crawl every known page at least some maximum period, don't hit any domain too hard, etc.) as a crawl priority. You then crawl the content, convert HTML, PDF, etc. to some marked-up text format (UTF-8 HTML isn't bad, but I think UTF-8 plain text plus some separate annotations in a binary format would be better). You strip out text that's too small or too close to the background color. You calculate one or more locality-sensitive hash functions over the plain text, cluster similar texts, pick a canonical URL for each cluster. You calculate the directed link graph across clusters. The PageRank patent has expired, so you could calculate PageRank and several other link-graph ranking signals across canonical clusters. You'd presumably compute some uniqueness scores, age scores, etc. for each canonical URL, and then in parallel run each of the search engine startup's analysis over this package of analysis data each time you find a change for a particular canonical URL.

You might have some startups providing spam scoring or other analysis and providing that (for fees, of course) to search engine startups, etc. Basically, you want to modularize the indexing and analysis to provide competition and nearly seamless transition between competing providers within your ecosystem.

I think that's the way to drive innovation in the search engine startup space and properly leverage economies of scale across search engine startups.

kaptainscarlet Aug 15, 2024 (dead)

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous