Longitudinal study of robots.txt files

Different annual statistics of the Common Crawl robots.txt web archives

View the Software Project on GitLab

User agent bias

Previous studies on robots.txt files, such as Sun et al, discuss a potential bias towards specific user agents. The hypothesis is that there exists a web-wide trend on the aversion of site owners and hence some crawlers are generally restricted to the access of smaller parts of websites than others. The analysis shows that such a bias definitely exists. Most notably, our study highlights that the bias in robots.txt files favors Googlebot, AdsBot-Google and Mediapartners-Google, reflecting Google’s dominance. Conversely, agents like CCBot, Google-Extended and GPTBot face increasing blocks, highlighting concerns about data control related to the development of generative AI tools. The following plot denotes the bias among user agents in 2016 and 2023, as defined by Sun et al. A positive bias means that the web agents are more favored than the global user agent, whereas a negative value indicates that the corresponding crawlers are restricted to less content in comparsion to the global user agent.

Plot of bias among user agents

The analysis comprises 14 product tokens (AdsBot-Google, AhrefsBot, Baiduspider, Bingbot, CCBot, ChatGPT-User, Googlebot, Google-Extended, GPTBot, ia_archiver, Mediapartners-Google, MJ12bot, PetalBot, Yandex) as well as the global user agent. Note that the bias value defined by Sun et al is a rather course heuristic as it basically counts the disallow instructions per user agent. However, it remains unclear how big is the part of the website, which is actually covered by the URL path in the disallow statement. We have therefore thought about a complementary bias measure by counting the number of the disallow all instructions per user agent. This value is set in relation to the total number of appearences of the respective user agent in robots.txt files.

Example:

User-agent: GPTBot
Disallow: /

We assume a disallow all statement if the robots.txt file contains a disallow instruction with the URL path / and therefore categorically excludes all crawling activities of the corresponding web agent.


User agent Ratio of disallow all instructions
to the total number of appearences
* 1.6 %
Adsbot-Google 1.27 %
AhrefsBot 66.94 %
Baiduspider 61.86 %
Bingbot 14.28 %
CCBot 20.43 %
ChatGPT-User -
Googlebot 2.86 %
Google-Extended -
GPTBot -
ia_archiver 32.96 %
Mediapartners-Google 0.39 %
MJ12bot 72.01 %
PetalBot -
Yandex 15.86 %
User agent Ratio of disallow all instructions
to the total number of appearences
* 1.39 %
Adsbot-Google 0.92 %
AhrefsBot 69.1 %
Baiduspider 29 %
Bingbot 53.51 %
CCBot 32.51 %
ChatGPT-User -
Googlebot 33.04 %
Google-Extended -
GPTBot -
ia_archiver 64.06 %
Mediapartners-Google 0.3 %
MJ12bot 80.73 %
PetalBot -
Yandex 17.94 %
User agent Ratio of disallow all instructions
to the total number of appearences
* 2.99 %
Adsbot-Google 0.31 %
AhrefsBot 54.31 %
Baiduspider 57.53 %
Bingbot 22.99 %
CCBot 45.77 %
ChatGPT-User -
Googlebot 10.98 %
Google-Extended -
GPTBot -
ia_archiver 41.46 %
Mediapartners-Google 0.28 %
MJ12bot 64.7 %
PetalBot -
Yandex 28.72 %
User agent Ratio of disallow all instructions
to the total number of appearences
* 2.89 %
Adsbot-Google 0.75 %
AhrefsBot 69.38 %
Baiduspider 33.81 %
Bingbot 33.13 %
CCBot 58.5 %
ChatGPT-User -
Googlebot 36.56 %
Google-Extended -
GPTBot -
ia_archiver 71.89 %
Mediapartners-Google 1.04 %
MJ12bot 69.37 %
PetalBot -
Yandex 29.96 %
User agent Ratio of disallow all instructions
to the total number of appearences
* 2.53 %
Adsbot-Google 0.32 %
AhrefsBot 57.96 %
Baiduspider 52.22 %
Bingbot 19.96 %
CCBot 55.51 %
ChatGPT-User -
Googlebot 13.82 %
Google-Extended -
GPTBot -
ia_archiver 52.17 %
Mediapartners-Google 0.9 %
MJ12bot 62 %
PetalBot 95.05 %
Yandex 29.68 %
User agent Ratio of disallow all instructions
to the total number of appearences
* 3.88 %
Adsbot-Google 0.32 %
AhrefsBot 51.48 %
Baiduspider 49.43 %
Bingbot 14.48 %
CCBot 60.9 %
ChatGPT-User -
Googlebot 8.84 %
Google-Extended -
GPTBot -
ia_archiver 43.3 %
Mediapartners-Google 1.11 %
MJ12bot 60.35 %
PetalBot 89.49 %
Yandex 28.01 %
User agent Ratio of disallow all instructions
to the total number of appearences
* 3.11 %
Adsbot-Google 0.16 %
AhrefsBot 48.83 %
Baiduspider 31.96 %
Bingbot 7.19 %
CCBot 61.81 %
ChatGPT-User -
Googlebot 4.45 %
Google-Extended -
GPTBot -
ia_archiver 21.37 %
Mediapartners-Google 0.64 %
MJ12bot 54.49 %
PetalBot 99.12 %
Yandex 27.92 %
User agent Ratio of disallow all instructions
to the total number of appearences
* 3.12 %
Adsbot-Google 0.19 %
AhrefsBot 18.71 %
Baiduspider 37.46 %
Bingbot 30.19 %
CCBot 98.97 %
ChatGPT-User 95.48 %
Googlebot 1.83 %
Google-Extended 99.92 %
GPTBot 99.61 %
ia_archiver 19.73 %
Mediapartners-Google 0.46 %
MJ12bot 50.98 %
PetalBot 99.08 %
Yandex 25.23 %