Longitudinal study of robots.txt files

Different annual statistics of the Common Crawl robots.txt web archives

View the Software Project on GitLab

Top user agents

Crawlers set their own name, also called product token, to find relevant groups of allow and disallow instructions. The product token (e.g. ExampleBot) is a substring of the User-Agent HTTP header. The following table lists the most frequently used product tokens in robots.txt files over the years. Note that for all further analysis of user agents we employ a case-insensitive comparison (Bingbot and bingbot are identical). The last column denotes the percentage of robots.txt files, which contain the respective product token.


# User agent Frequency
1 * 97.8 %
2 Mediapartners-Google 15.5 %
3 Yandex 5.91 %
4 Baiduspider 5.3 %
5 Googlebot 5.15 %
6 msnbot 4.21 %
7 Irlbot 3.99 %
8 MJ12bot 3.7 %
9 ia_archiver 3.49 %
10
Slurp 3.21 %
11 SemrushBot 2.91 %
12 Googlebot-Image 2.83 %
13 Ips-Agent 2.42 %
14 Bingbot 2.21 %
15 AhrefsBot 2.09 %
16 Webcopier 1.81 %
17 Psbot 1.79 %
18 Webzip 1.78 %
19 Webstripper 1.77 %
20 Sitesnagger 1.76 %
# User agent Frequency
1 * 96.47 %
2 Mediapartners-Google 19.69 %
3 AhrefsBot 7.4 %
4 MJ12bot 6.83 %
5 Irlbot 5.4 %
6 Yandex 5.18 %
7 Googlebot 4.82 %
8 SemrushBot 4.57 %
9 msnbot 3.92 %
10
Bingbot 3.84 %
11 Baiduspider 3.55 %
12 Semrushbot-Sa 3.29 %
13 Slurp 2.93 %
14 Dotbot 2.87 %
15 ia_archiver 2.85 %
16 Twiceler 2.54 %
17 Gigabot 2.34 %
18 Teoma 2.23 %
19 YandexBot 2.2 %
20 Megaindex.Ru 2.17 %
# User agent Frequency
1 * 98.15 %
2 Mediapartners-Google 8.64 %
3 MJ12bot 3.95 %
4 Adsbot-Google 3.87 %
5 AhrefsBot 3.6 %
6 Googlebot 3.59 %
7 Yandex 3.54 %
8 ia_archiver 2.57 %
9 AdsBot-Google-Mobile 2.56 %
10
Adsbot-Google-Mobile-Apps 2.55 %
11 Bingbot 2.48 %
12 Baiduspider 2.46 %
13 Dotbot 1.89 %
14 SemrushBot 1.79 %
15 msnbot 1.74 %
16 Psbot 1.67 %
17 Nutch 1.49 %
18 Sosospider 1.38 %
19 Slurp 1.34 %
20 Webcopier 1.3 %
# User agent Frequency
1 * 98.07 %
2 Googlebot 5.81 %
3 MJ12bot 4.72 %
4 AhrefsBot 4.12 %
5 Baiduspider 3.89 %
6 Mediapartners-Google 3.88 %
7 Bingbot 3.22 %
8 ia_archiver 3.21 %
9 Yandex 2.83 %
10
msnbot 2.73 %
11 Slurp 2.5 %
12 Psbot 2.4 %
13 SemrushBot 2.24 %
14 Sogou Spider 2.19 %
15 Gigabot 2.18 %
16 Googlebot-Image 2.07 %
17 Dotbot 2.01 %
18 Sosospider 1.91 %
19 Googlebot-Mobile 1.86 %
20 Teoma 1.71 %
# User agent Frequency
1 * 98.32 %
2 Adsbot-Google 6.34 %
3 AdsBot-Google-Mobile 4.63 %
4 MJ12bot 4.06 %
5 AhrefsBot 3.39 %
6 Googlebot 3.38 %
7 Mediapartners-Google 2.86 %
8 Bingbot 2.22 %
9 Baiduspider 2.19 %
10
Yandex 2.18 %
11 SemrushBot 2.1 %
12 ia_archiver 1.72 %
13 Nutch 1.67 %
14 Dotbot 1.58 %
15 Blexbot 1.41 %
16 Pinterest 1.35 %
17 Nerdybot 1.34 %
18 Slurp 1.28 %
19 AhrefsSiteAudit 1.22 %
20 msnbot 1.18 %
# User agent Frequency
1 * 98.28 %
2 Adsbot-Google 8.04 %
3 AdsBot-Google-Mobile 5.71 %
4 MJ12bot 5.53 %
5 Googlebot 4.5 %
6 AhrefsBot 4.4 %
7 Mediapartners-Google 3.59 %
8 Bingbot 3.28 %
9 Baiduspider 3.03 %
10
Yandex 2.53 %
11 SemrushBot 2.51 %
12 Nutch 2.3 %
13 Dotbot 2.26 %
14 ia_archiver 2.14 %
15 Pinterest 1.98 %
16 AhrefsSiteAudit 1.83 %
17 Blexbot 1.83 %
18 Slurp 1.63 %
19 msnbot 1.52 %
20 Googlebot-Image 1.3 %
# User agent Frequency
1 * 96.33 %
2 Adsbot-Google 9.12 %
3 PetalBot 7.45 %
4 AdsBot-Google-Mobile 6.81 %
5 Googlebot 6.19 %
6 Mediapartners-Google 5.12 %
7 MJ12bot 4.9 %
8 Bingbot 4.89 %
9 Baiduspider 4.5 %
10
AhrefsBot 4.33 %
11 ia_archiver 3.57 %
12 Yandex 2.55 %
13 YandexBot 2.54 %
14 SemrushBot 2.53 %
15 Sosospider 2.5 %
16 Yahoo! Slurp 2.46 %
17 Seznambot 2.36 %
18 Sogou Spider 2.35 %
19 Nutch 2.26 %
20 Youdaobot 2.21 %
# User agent Frequency
1 * 96.38 %
2 Googlebot 12.85 %
3 AhrefsBot 11.39 %
4 Adsbot-Google 9.54 %
5 PetalBot 7.82 %
6 AdsBot-Google-Mobile 7.08 %
7 Bingbot 5.92 %
8 MJ12bot 4.81 %
9 Mediapartners-Google 4.66 %
10
Baiduspider 4.01 %
11 ia_archiver 3.52 %
12 SemrushBot 2.65 %
13 Yandex 2.61 %
14 Yahoo! Slurp 2.53 %
15 YandexBot 2.49 %
16 Dotbot 2.43 %
17 Nutch 2.36 %
18 Sosospider 2.25 %
19 Seznambot 2.24 %
20 Sogou Spider 2.20 %