- https://yacy.net/ - Peer to peer index, you can create your own index
- https://commoncrawl.org/ - Open crawl data for the web. You get all the URLs
- https://developers.google.com/web/tools/chrome-user-experience-report/ - Chrome seems to have made a lot of data public
- https://almanac.httparchive.org/en/2021/methodology#websites
- https://searx.github.io/searx/ - Searx is a free internet metasearch engine which aggregates results from more than 70 search services. Users are neither tracked nor profiled. Additionally, searx can be used over Tor for online anonymity.
- https://openwebindex.eu/ - Proposal for open web index
- https://searchmysite.net/search/browse/
- https://indieseek.xyz/links/
All of these options are good for different use cases.
As I understand it, there is a crawling part where you identify URLs. Then an index part where you analyze the content.
For crawling there are many open source ways to do it. But I decided to give scrapy a try. It's a python library. https://docs.scrapy.org/en/latest/intro/tutorial.html
### Thanks below are personal blog websites that I found interesting. I am not sure if they are still active. https://blot.im/help
===
<p><strike><a href="http://900dpi.com/" rel="nofollow">900dpi</a></strike>, <strike><a href="https://amb-1.com" rel="nofollow">Amb-1</a></strike>, <a href="https://asocialfolder.com/" rel="nofollow">asocialfolder</a>, <strike><a href="https://www.boxfolio.com/" rel="nofollow">Boxfolio</a></strike>, <strike><a href="http://brace.io/" rel="nofollow">Brace</a></strike>, <strike><a href="http://calepin.co/" rel="nofollow">Calepin</a></strike>, <a href="http://cloudcannon.com/" rel="nofollow">Cloud Cannon</a>, <a href="http://droppages.com/" rel="nofollow">Droppages</a>, <strike><a href="http://dropplets.com/" rel="nofollow">Dropplets</a></strike>, <a href="http://duet.to/" rel="nofollow">Duetto</a>, <strike><a href="http://fargo.io/" rel="nofollow">Fargo</a></strike>, <strike><a href="https://www.harp.io/" rel="nofollow">Harp</a></strike>, <a href="https://www.kissr.com/" rel="nofollow">Kissr</a>, <a href="https://montaigne.io" rel="nofollow">Montaigne</a>, <strike><a href="http://markbox.io/" rel="nofollow">Markbox</a></strike>, <strike><a href="https://www.pancake.io/" rel="nofollow">Pancake</a></strike>, <strike><a href="http://scriptogr.am/" rel="nofollow">Scriptogram</a></strike>, <a href="http://www.site44.com/" rel="nofollow">Site44</a>, <strike><a href="https://www.sitebox.io/" rel="nofollow">Sitebox</a></strike>, <strike><a href="http://skrivr.com/" rel="nofollow">Skrivr</a></strike>, <strike><a href="https://www.smallvictori.es/" rel="nofollow">Small Victories</a></strike>, <strike><a href="https://www.synkee.com/" rel="nofollow">Synkee</a></strike>, <strike><a href="http://updog.co/" rel="nofollow">Updog</a></strike></p>