Overview

Since 2016, Common Crawl regularly publishes the robots.txt files that have been fetched during the CCBot’s web crawl. The robots.txt dumps are published along with the regular WARC, WAT and WET files in intervals of approximately two to three months. We have parsed the last robots.txt dumps of each year since 2016, resulting in eight years of collected statistics.

File statistics - Average content length (file size), Average number of lines and user agents
Top user agents - Most frequently mentioned agent names
User agent bias - Number of disallow all instructions per user agent
Resources - Dataset of extracted links to valid robots.txt files and sitemaps

The following table outlines each year together with the period, in which the robots.txt were fetched (capture time). As the dumps also contain unsuccessful fetches (e.g. HTTP status code 404) and unparsable files, the table also yields the total number of successfully parsed robots.txt files and gives an estimation of the adoption rate of robots.txt among websites (to be more precise, hosts).

	Capture times		Size of crawl dump	Thereof valid robots.txt files	Adoption among websites
Year	Start	End	Size of crawl dump	Thereof valid robots.txt files	Adoption among websites
2023	Nov 28	Dec 12	74,300,729	41,611,877	56.0 %
2022	Nov 26	Dec 10	63,477,936	37,379,200	58.9 %
2021	Nov 26	Dec 09	65,995,031	38,259,274	58.0 %
2020	Nov 23	Dec 06	69,379,415	43,557,331	62.8 %
2019	Dec 05	Dec 16	63,224,805	36,415,185	57.6 %
2018	Dec 09	Dec 19	91,547,206	47,957,269	52.4 %
2017	Dec 11	Dec 19	57,854,093	34,782,896	60.1 %
2016	Dec 02	Dec 07	57,746,232	33,727,243	58.4 %