I had been googling Google's alternatives, and here is what I found.

Google provides a great service of providing a web search engine. The scale of the task is huge.

If you had to go open source way then following are some interesting organizations that I came across

  • - Peer to peer index, you can create your own index
  • - Open crawl data for the web. You get all the URLs
  • - Chrome seems to have made a lot of data public
  • - Searx is a free internet metasearch engine which aggregates results from more than 70 search services. Users are neither tracked nor profiled. Additionally, searx can be used over Tor for online anonymity.
  • - Proposal for open web index

All of these options are good for different use cases.

As I understand it, there is a crawling part where you identify URLs. Then an index part where you analyze the content.

For crawling there are many open source ways to do it. But I decided to give scrapy a try. It's a python library.