Longitudinal study of robots.txt files

Different annual statistics of the Common Crawl robots.txt web archives

View the Software Project on GitLab

Overview

Since 2016, Common Crawl regularly publishes the robots.txt files that have been fetched during the CCBot’s web crawl. The robots.txt dumps are published along with the regular WARC, WAT and WET files in intervals of approximately two to three months. We have parsed the last robots.txt dumps of each year since 2016, resulting in eight years of collected statistics.

The following table outlines each year together with the period, in which the robots.txt were fetched (capture time). As the dumps also contain unsuccessful fetches (e.g. HTTP status code 404) and unparsable files, the table also yields the total number of successfully parsed robots.txt files and gives an estimation of the adoption rate of robots.txt among websites (to be more precise, hosts).

Capture times Size of crawl dump Thereof valid robots.txt files Adoption among websites
Year Start End
2023 Nov 28 Dec 12 74,300,729 41,611,877 56.0 %
2022 Nov 26 Dec 10 63,477,936 37,379,200 58.9 %
2021 Nov 26 Dec 09 65,995,031 38,259,274 58.0 %
2020 Nov 23 Dec 06 69,379,415 43,557,331 62.8 %
2019 Dec 05 Dec 16 63,224,805 36,415,185 57.6 %
2018 Dec 09 Dec 19 91,547,206 47,957,269 52.4 %
2017 Dec 11 Dec 19 57,854,093 34,782,896 60.1 %
2016 Dec 02 Dec 07 57,746,232 33,727,243 58.4 %