I had been googling Google's alternatives, and here is what I found.
Google provides a great service of providing a web search engine. The scale of the task is huge.
If you had to go open source way then following are some interesting organizations that I came across
- https://yacy.net/ - Peer to peer index, you can create your own index
- https://commoncrawl.org/ - Open crawl data for the web. You get all the URLs
- https://developers.google.com/web/tools/chrome-user-experience-report/ - Chrome seems to have made a lot of data public
- https://searx.github.io/searx/ - Searx is a free internet metasearch engine which aggregates results from more than 70 search services. Users are neither tracked nor profiled. Additionally, searx can be used over Tor for online anonymity.
- https://openwebindex.eu/ - Proposal for open web index
All of these options are good for different use cases.
As I understand it, there is a crawling part where you identify URLs. Then an index part where you analyze the content.
For crawling there are many open source ways to do it. But I decided to give scrapy a try. It's a python library. https://docs.scrapy.org/en/latest/intro/tutorial.html