Longitudinal study of robots.txt files

Different annual statistics of the Common Crawl robots.txt web archives

View the Software Project on GitLab

Resources

In the course of the study, we have extracted the URLs of the valid robots.txt files from the crawl dump in December 2023, as well as the Sitemap links from the robots.txt files. Moreover, the links are aggregated by website category using the Curlie top level labels. Both datasets are available for download on this Zenodo Repository.

Robots.txt URLs
List of all valid robots.txt URLs 41,611,704 Download
Curlie-curated list of valid robots.txt URLs 314,228 Download
Sitemap links
List of all sitemap links 32,252,027 Download
Curlie-curated list of sitemap links 254,298 Download