Different annual statistics of the Common Crawl robots.txt web archives
Since 2016, Common Crawl regularly publishes the robots.txt files that have been fetched during the CCBot’s web crawl. The robots.txt dumps are published along with the regular WARC, WAT and WET files in intervals of approximately two to three months. We have parsed the last robots.txt dumps of each year since 2016, resulting in eight years of collected statistics.
The following table outlines each year together with the period, in which the robots.txt were fetched (capture time). As the dumps also contain unsuccessful fetches (e.g. HTTP status code 404) and unparsable files, the table also yields the total number of successfully parsed robots.txt files and gives an estimation of the adoption rate of robots.txt among websites (to be more precise, hosts).
Capture times | Size of crawl dump | Thereof valid robots.txt files | Adoption among websites | ||
---|---|---|---|---|---|
Year | Start | End | |||
2023 | Nov 28 | Dec 12 | 74,300,729 | 41,611,877 | 56.0 % |
2022 | Nov 26 | Dec 10 | 63,477,936 | 37,379,200 | 58.9 % |
2021 | Nov 26 | Dec 09 | 65,995,031 | 38,259,274 | 58.0 % |
2020 | Nov 23 | Dec 06 | 69,379,415 | 43,557,331 | 62.8 % |
2019 | Dec 05 | Dec 16 | 63,224,805 | 36,415,185 | 57.6 % |
2018 | Dec 09 | Dec 19 | 91,547,206 | 47,957,269 | 52.4 % |
2017 | Dec 11 | Dec 19 | 57,854,093 | 34,782,896 | 60.1 % |
2016 | Dec 02 | Dec 07 | 57,746,232 | 33,727,243 | 58.4 % |