How much of your web traffic is bots?

pcollinson · 27 June 2024 10:50

I’ve been having a war on bots, many follow robots.txt rules but many don’t. Some are just relentless and won’t stop even if you ask them politely in robots.txt. Many come at you from a huge range of IP addresses in parallel, so they are eating into the service you are providing for your actual viewers.

I am shocked that over the last 12 days, I’ve reduced my traffic by 85%. It’s dramatically reduced the load on the machine too. Now, my web sites are really only hobby ones, so I don’t expect loads of visitors, so your reduction may not be so dramatic.

Many of these bots are trawling for AI purposes, many for private companies offering SEO services. Who you block is up to you, and you may want them to scan your pages. I haven’t killed them all, I obviously want the major search engines to keep coming back.

What to put in robots.txt

You’ll all probably know this but I’ve put it here for completeness. Let’s say you want to block ChatGPT’s bot, you add this into robots.txt:

User-agent: GPTBot
Disallow: /

You’ll find that GPTBot will stop, eventually.

What about those that are relentless or ignore robots.txt?

First wait for a day or so, so that your new robots.txt get’s pulled.

Then you can get apache to block them with a 403 error. Create /srv/example.com/config/apache.d, and insert this into crawler.conf in that directory. Then restart apache.

<Directory "/srv/example.com/public/htdocs">
RewriteEngine on
RewriteCond ^.*robots.txt [L]
RewriteCond %{HTTP_USER_AGENT} (ahrefsbot|amazonbot|aspiegelbot|barkrowler|bytespider|claudebot|dotbot|facebookexternalhit|friendlycrawler|gptbot|mauibot|mj12bot|monsidobot|nicecrawler|petalbot|semrushbot|zoominfobot|owler)  [NC]
RewriteRule .* - [F,L]
</Directory>

The list here is the one I am using today, and should match the names in robots.txt. The NC in the match forces it to be case-independent. You should look for yourself at who is coming in and who you want to block. Note that facebookexternalhit is relentless and doesn’t appear to try to get robots.txt.

I found that sites will not stop if given just a 403 error and are getting no data, you do need to do a robots.txt entry and allow them to pull the file.

Anyway now you are delivering a robots.txt file that they should obey, and the 403 is easily detectable in logs so you can find the non-friendly bots. I’m blocking some of their IPs in my firewall.

compassweb · 2 July 2024 14:09

Thanks for the post, I found this useful particularly as you’ve provided a working example for completeness. In the interests of reducing load on the machine which are the bots worth blocking?

pcollinson · 2 July 2024 16:44

Your best bet is to look for robots.txt in your log files, bang their name into Google (other search engines are available, I’ve switched to DuckDuckGo) and see if you want to stop them. I use ‘is XXXX bad’ and get results with that.

These pages: Web crawlers - LinuxReviews and Bad and Good Crawling Bots List — Simtech Development are very helpful. There are other pages.

Something like

cd /srv
find . -wholename '*/public/logs/ssl_access.log' -exec cat {} \; | grep robots.txt | grep compatible | sed -e 's/^.*compatible; //' -e 's|/.*$||' | sort | uniq -c | sort -nr

there may be better ways - this doesn’t find all bots BTW, some don’t include ‘compatible’ in their referrer,

I am down now to most of the bots going away. One’s that haven’t are:

amazonbot - this one is persistent, comes at you from several IP addresses in parallel, and those IPs are now in my firewall, I made sure it had my banning robots.txt file for several days before using the firewall. It’s still trying several times a day, albeit with less alacrity, after 31 days. To be fair, it seems to arrive when the sun is down.
Barkrowler - again another persistent one, coming from several IP addresses in France. Seems to support SEO efforts. It’s been in my firewall for the last five days after having about 25 days to get the robots.txt and obey it. It came initially from 8 IP addresses in parallel, now from 5.

Others I block in robots.txt and have left (some are still getting robots.txt):

SemrushBot - some sort of firm that sells links and graphs.
MJ12Bot - it’s again something to do with SEO.
AhrefsBot - commercial link index.
facebookexternalhit - you need to look this up to see if you want to block it. It doesn’t read robots.txt files and is very very demanding.
Claudebot - is very aggressive and is supporting AI. They are using my information for free, and then charging. This doesn’t seem like an equal relationship.
PetalBot (AspiegelBot) - this is a chinese search engine I think, but is too aggressive for me.
DotBot - is a web crawler used by Moz.com. The data collected through DotBot is surfaced on the simtechdev.com site, in Moz tools, and is also available via the Mozscape API. However, it’s proved to be a bandwidth gobbler.
GPTBot - guess what. It’s an AI bot, and they charge
Owler - some sort of sales intelligence tool. Since I am no longer any type of business, this is not relevant.

Others from the list in the Apache config file, that I don’t have in my robots.txt file

Bytespider - doesn’t obey robots.txt
FriendlyCrawler - isn’t
MauiBot - comes from aws and is aggressive
Monsidobot - seems to work on request from companies
NiteCrawler - (a typo in my list in the posting) works to look for hotels, since I am not a hotel
ZoomInfo - sells info on companies

As I say, please don’t take this as a list of sites that you must block, it should be your decision based on your personal view of the information that they use and whether they are not polite in doing it.

pcollinson · 2 July 2024 17:00

There must be a better way…

cd /srv
find . -name 'ssl_access.log' -o -name 'ssl_access.log.1' -exec cat {} \; | grep robots.txt | grep compatible | sed -e 's/^.*compatible; //' -e 's|/.*$||' | sort | uniq -c | sort -nr

This does today and yesterday.