Ognjen Regoje bio photo

MY NAME IS
Ognjen Regoje
BUT YOU CAN CALL ME OGGY


I make things that run on the web (mostly).
More /ABOUT me.

me@ognjen.io Twitter LinkedIn Instagram Github

Ways of coping with huge spikes in spider/crawler traffic

With Supplybunny’s growth in suppliers and products there’s been a significant increase in traffic due to spiders crawling the site.

Human traffic is very smooth and predictable while spider traffic has extreme spikes. While optimization has to be done first no amount of it can help when a single client is effectively requesting 30% of the entire day’s traffic within an hour (a 70× spike). The extreme is some Chinese spiders that have been responsible for more requests in 24 hours than all other agents in a month and have sustained that load for a few days.

To cope with this issue, and not to blindly throw hardware at the problem, here are some things I did.

Webmaster consoles

In Google webmaster console you can set a rate at which you want Google to crawl but this is just valid for 90 days and has to be renewed.

Bing is better and you can set presets per time of day as well.

robots.txt

Crawl-Delay

You can set the Crawl-Delay directive although it is non-standard and only some bots follow it.

User-Agent: Bingbot
Allow: *
Crawl-delay: 10

Blocking spiders by default

User-Agent: *
Disallow: *

User-Agent: Googlebot
Allow: *

Note that this will downgrade Lighthouse SEO audit score because it looks only for the User-Agent: * and assumes that the site is blocked to robots.

Now, these measures are useful for the ethical bots but the most problematic ones wont obey anyways.

nginx

Nginx is the nuclear solution and has the benefit of handling traffic before the application so is much faster.

Rate limit by user agent

First you can set request limits by UserAgent. The benefit of this is that there are plenty of UserAgent lists covering nearly all bots and you can group them all based on that, and then create a separate group for nice bots (Google, Bing). Then you can give all bots one rate and nice bots a higher one.

It takes a bit of monitoring to get the limits right but ends up working well in the end.

Here’s the config that would allow Googlebot 30 requests per minute and all others 30 per minute.

map $http_user_agent $isbot_ua {
        default 0;
        ~*(GoogleBot) 1;
}
map $isbot_ua $limit_bot {
        0       "";
        1       $binary_remote_addr;
}

limit_req_zone $limit_bot zone=bots:10m rate=30r/m;
limit_req_zone $binary_remote_addr zone=one:10m rate=30r/s;

limit_req zone=bots burst=5 nodelay;
limit_req zone=one burst=30 nodelay;

More details on how to set this up are here:

  • https://www.nginx.com/blog/rate-limiting-nginx/
  • http://nginx.org/en/docs/http/ngx_http_limit_req_module.html
  • https://www.freecodecamp.org/news/nginx-rate-limiting-in-a-nutshell-128fe9e0126c/

GeoIP

The nuclear option within the nuclear option. The Chinese (and Russian to a lesser extent) bots just wouldn’t follow any rules. They don’t follow robots. They do not have consistent user agents. They’ve many IPs.

Recompiling nginx with the GeoIP module and basically dropping all traffic from China and Russia was the only remaining viable solution.

For Supplybunny it’s completely workable since we’re very local. We don’t have legitimate traffic from either country. Even if we did, we wouldn’t be able to serve it.

Compiling nginx with GeoIP is relatively well documented. Compiling it with rvm (and passenger) needed a few more steps.

First, the correct gemset has to be set up, rvm useed and and passenger installed.

Nginx has to be compiled from source and the latest version can be found here. GeoIP source is here..

Then, run rvmsudo passenger-install-nginx-module to start, and choose the custom installation:

2. No: I want to customize my Nginx installation. (for advanced users)

The first step is to enter the location of nginx source which is where the source was extracted to.

The prefix path can be left the same.

Extra arguments to pass to configure script: needs to include the GeoIP source like this:

--add-dynamic-module=/path/to/ngx_http_geoip2_module-master

Nginx will then compile.

Finally, here’s what needs to be added to nginx.conf if you needed to block China and Russia:

load_module modules/ngx_http_geoip2_module.so;

# This must be installed and downloaded separately.
geoip2 /usr/share/GeoIP/GeoLite2-Country.mmdb {
  $geoip2_data_country_code country iso_code;
}

map $geoip2_data_country_code $allowed_country {
  default yes;
  CN no;
  RU no;
}

It was specifically the GeoIP step that saved me. The amount of garbage traffic coming in as just ridiculous.

#supplybunny #technical