Different annual statistics of the Common Crawl robots.txt web archives
In the course of the study, we have extracted the URLs of the valid robots.txt files from the crawl dump in December 2023, as well as the Sitemap links from the robots.txt files. Moreover, the links are aggregated by website category using the Curlie top level labels. Both datasets are available for download on this Zenodo Repository.
Robots.txt URLs | ||
---|---|---|
List of all valid robots.txt URLs | 41,611,704 | Download |
Curlie-curated list of valid robots.txt URLs | 314,228 | Download |
Sitemap links | ||
List of all sitemap links | 32,252,027 | Download |
Curlie-curated list of sitemap links | 254,298 | Download |