I’ve been having a war on bots, many follow robots.txt rules but many don’t. Some are just relentless and won’t stop even if you ask them politely in robots.txt. Many come at you from a huge range of IP addresses in parallel, so they are eating into the service you are providing for your actual viewers.
I am shocked that over the last 12 days, I’ve reduced my traffic by 85%. It’s dramatically reduced the load on the machine too. Now, my web sites are really only hobby ones, so I don’t expect loads of visitors, so your reduction may not be so dramatic.
Many of these bots are trawling for AI purposes, many for private companies offering SEO services. Who you block is up to you, and you may want them to scan your pages. I haven’t killed them all, I obviously want the major search engines to keep coming back.
What to put in robots.txt
You’ll all probably know this but I’ve put it here for completeness. Let’s say you want to block ChatGPT’s bot, you add this into robots.txt:
User-agent: GPTBot
Disallow: /
You’ll find that GPTBot will stop, eventually.
What about those that are relentless or ignore robots.txt?
First wait for a day or so, so that your new robots.txt get’s pulled.
Then you can get apache to block them with a 403 error. Create /srv/example.com/config/apache.d, and insert this into crawler.conf in that directory. Then restart apache.
<Directory "/srv/example.com/public/htdocs">
RewriteEngine on
RewriteCond ^.*robots.txt [L]
RewriteCond %{HTTP_USER_AGENT} (ahrefsbot|amazonbot|aspiegelbot|barkrowler|bytespider|claudebot|dotbot|facebookexternalhit|friendlycrawler|gptbot|mauibot|mj12bot|monsidobot|nicecrawler|petalbot|semrushbot|zoominfobot|owler) [NC]
RewriteRule .* - [F,L]
</Directory>
The list here is the one I am using today, and should match the names in robots.txt. The NC in the match forces it to be case-independent. You should look for yourself at who is coming in and who you want to block. Note that facebookexternalhit is relentless and doesn’t appear to try to get robots.txt.
I found that sites will not stop if given just a 403 error and are getting no data, you do need to do a robots.txt entry and allow them to pull the file.
Anyway now you are delivering a robots.txt file that they should obey, and the 403 is easily detectable in logs so you can find the non-friendly bots. I’m blocking some of their IPs in my firewall.