search-engine

I had been googling Google's alternatives, and here is what I found.

Google provides a great service of providing a web search engine. The scale of the task is huge.

If you had to go open source way then following are some interesting organizations that I came across

  • https://yacy.net/ - Peer to peer index, you can create your own index
  • https://commoncrawl.org/ - Open crawl data for the web. You get all the URLs
  • https://developers.google.com/web/tools/chrome-user-experience-report/ - Chrome seems to have made a lot of data public
  • https://almanac.httparchive.org/en/2021/methodology#websites
  • https://searx.github.io/searx/ - Searx is a free internet metasearch engine which aggregates results from more than 70 search services. Users are neither tracked nor profiled. Additionally, searx can be used over Tor for online anonymity.
  • https://openwebindex.eu/ - Proposal for open web index
  • https://searchmysite.net/search/browse/
  • https://indieseek.xyz/links/

All of these options are good for different use cases.

As I understand it, there is a crawling part where you identify URLs. Then an index part where you analyze the content.

For crawling there are many open source ways to do it. But I decided to give scrapy a try. It's a python library. https://docs.scrapy.org/en/latest/intro/tutorial.html

links