segments.paths.gz |
This file lists all the segments of a particular Common Crawl. Each line in the file provides the path to a segment directory within the crawl archive on AWS S3 or via HTTP(S). |
This file is primarily used to get a list of all segments available in a crawl. You can iterate through the lines and append the base URL (s3://commoncrawl/ or https://data.commoncrawl.org/ ) to access the segment directories. |
warc.paths.gz |
This file contains a list of all the raw web crawl data files (WARC files) for a specific Common Crawl. Each line provides the path to a WARC file. WARC files store the raw HTTP requests and responses, along with metadata. |
These files can be processed using Python libraries like warcio or FastWARC .[11] You can download specific WARC files using their paths with tools like wget or the AWS CLI.[6, 12] For large-scale processing, consider using distributed computing frameworks like Spark with libraries such as cc-pyspark .[10, 13] |
wat.paths.gz |
This file lists all the WAT (Web ARChive Timestamp) files for a given crawl. WAT files contain metadata extracted from the crawled web pages, including HTTP response headers and links.[7, 8, 1, 2, 3, 4, 5, 9] |
WAT files are typically processed to analyze the structure of the web, perform link analysis, or conduct metadata-based research. You can use Python to download and parse these files, often in conjunction with WARC files.[9, 10, 14, 15, 16] Libraries like warcio can handle WAT files as they are also in WARC format.[7, 8] |
wet.paths.gz |
This file contains a list of all the WET (WARC Encapsulated Text) files for a crawl. WET files contain the plain text extracted from the content of the crawled web pages.[7, 8, 1, 2, 3, 4, 5, 9] |
WET files are useful for text analysis and natural language processing (NLP) tasks. You can download and process these files using Python for linguistic analysis, content categorization, and other text-focused activities.[10, 17, 18, 19, 20] Libraries like warcio can be used, and for potential gzip handling issues, Python's gzip library might be needed.[17] |
robotstxt.paths.gz |
This file lists all the robots.txt files found during the crawl. robots.txt files are used by websites to tell automated crawlers which parts of the site should not be accessed.[1, 2, 3, 4, 5, 21] |
These files can be processed to understand website crawling policies. You can download them using their paths and analyze their content to identify allowed and disallowed URLs for web crawlers.[21, 22, 23, 24, 25, 26] |
non200responses.paths.gz |
This file contains a list of all web pages for which the crawl received a non-200 HTTP status code (e.g., 404 Not Found, 301 Moved Permanently). This can be useful for identifying broken links or redirects.[1, 2, 3, 4, 5, 27, 28] |
You can process this file to analyze website availability and identify issues encountered during the crawl. By downloading the corresponding WARC records, you can further investigate the reasons for the non-200 status codes.[27] |
cc-index.paths.gz |
This file lists the paths to the Common Crawl URL index files. The URL index provides information about every URL crawled, including its location (offset and length) within the WARC files.[29, 1, 2, 3, 4, 5] |
This index is crucial for targeted data retrieval. You can query it to find the specific WARC file and byte offset for a particular URL. Tools like cdx-toolkit [30] and the Common Crawl Index Server API [12, 31, 32] can be used to interact with this index. You can also process these index files using Python to extract specific information.[33, 34] |
cc-index-table.paths.gz |
This file lists the paths to the columnar URL index files, which are stored in Apache Parquet format. This index provides a more efficient way to query and analyze the crawled data using SQL-like queries.[7, 35, 1, 2, 3, 4, 5] |
The columnar index can be queried using tools like AWS Athena [35, 36], Apache Spark [35, 36], or libraries like pyarrow and fastparquet in Python.[37] This allows for efficient filtering and analysis of the crawl data based on various metadata fields like URL, content type, language, and more.[33, 13, 30, 38] |