Insights about the latest Common Crawl dataset.
The latest Common Crawl dataset provides a wealth of information for researchers and developers alike. With its vast archive of web pages, it enables a variety of applications — from training machine learning models to conducting large-scale web analysis.
---
Top Domains by Number of Pages in the Dataset
The table below lists the top 100 domains by the number of pages they contribute to the latest Common Crawl dataset:
| # | Count | Domain |
|---|---|---|
| 1 | 17668631 | blogspot.com |
| 2 | 4367122 | wikipedia.org |
| 3 | 1858874 | wordpress.org |
| 4 | 1700207 | ebay.com |
| 5 | 1480496 | europa.eu |
| 6 | 1323520 | app.link |
| 7 | 1189629 | google.com |
| 8 | 1155701 | wiktionary.org |
| 9 | 1149496 | ning.com |
| 10 | 1140009 | investing.com |
| 11 | 1139779 | rakuten.co.jp |
| 12 | 1071707 | fandom.com |
| 13 | 1059788 | exblog.jp |
| 14 | 950627 | medium.com |
| 15 | 911901 | ox.ac.uk |
| 16 | 874891 | aif.ru |
| 17 | 856566 | pixnet.net |
| 18 | 823272 | hh.ru |
| 19 | 810382 | googlesource.com |
| 20 | 790915 | qq.com |
(Table truncated for brevity — full list continues up to #100)
---
Total Pages in the Crawl
2,616,796,857
The number 2,616,796,857 in words is:
Two billion, six hundred sixteen million, seven hundred ninety-six thousand, eight hundred fifty-seven.