my home page

web hits analysis -- how many hits are bots?

It has been my understanding for years now that the vast majority of web site hits are robots--search engines and robotic hack attempts. My analysis so far is that 54% of kwynn.com's hits for about 26 months, until early February, 2019, are either stated to be from robots, or I have reason to believe they are.

The following is a brief analysis:

my first email on the subject

I just analyzed my web logs (kwynn.com) in more depth than I ever have. Your stats are almost certainly roughly the same as mine. I'm analyzing from mid-December, 2016 until a few hours ago, so almost 26 months. I got almost 700,000 (700k) individual hits. I'm analyzing the absolute raw data--individual lines in a file recording every single hit.

I sorted the hits by their identifier. The technical term is a "user agent." You'll get a better understanding of that in a moment.

The first 35% of hits say they are from robots, as below. The percentages are rounded, but the un-rounded (elements) sum shows 35% total for the first 8 most common "agents" below:

7% spbot
5% DotBot
5% AhrefsBot
4% YandexBot/3.0
4% SemrushBot [a specific version of the robot program]
3% Jorgee -- this is a "vulnerability detector," or hacking attemps
3% Googlebot/2.1 [a specific version]
3% BLEXBot

The next one, at 3%, is a null agent or no agent given. That's almost certainly hackers. The next 4, again, say that are robots.

3% SemrushBot [another version]
2% Baiduspider
2% Googlebot -- the mobile compatibility version
2% bingbot 

The above are the most common 13 agents totaling 47% of hits.

The next one is a very generic 2% "Mozilla/5.0." My relatively informed opinion is that's hackers and (other) robots.

The next 3 say they are robots:

2% MJ12bot [a given version]
1% MJ12bot [another versoin]
1% MegaIndex.ru

That's the 17 most common and 54% of hits. That's all hackers and robots.

FINALLY, numbers 18 and 19 at (rounded to) 1% each are specific agent versions that I believe to be humans. After that, the data bounces between less common robots and probably legitimate human-driven browsers. I'll probably keep processing the data to be more specific.

Depending on how far back your logs go, I can run my program on your data, eventually.

exact output (well, slightly edited for greater clarity and privacy)

For the 25 agents I list, the first column is the rounded percentage of that agent. After the agent is the rank of the agent, 1 ... 25, from most common to trailing away. Then I give the total percentage calculated to that point. The individual terms are not rounded, only the total is rounded, so it's an accurate total.

In other words, the first 25 agents are 60% of all the agents in the data set.

"/usr/bin/php" "/home/[...]/logs/logs.php"
/home/[...]/logs/logs.php:108:

array(4) {
  'lines'  => int(672432)
  'start'  => string(31) "Thu, 15 Dec 2016 06:32:49 -0500"
  'end'    => string(31) "Sat, 02 Feb 2019 20:49:56 -0500"
  'agents' => int(7054)
}

7% spbot/5.0.3; +http://OpenLinkProfiler.org/bot ) 1 7% 
5% DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com) 2 13% 
5% AhrefsBot/5.2; +http://ahrefs.com/robot/) 3 18% 
4% YandexBot/3.0; +http://yandex.com/bots) 4 22% 
4% SemrushBot/2~bl; +http://www.semrush.com/bot.html) 5 26% 
3% Jorgee 6 29% 
3% Googlebot/2.1; +http://www.google.com/bot.html) 7 32% 
3% BLEXBot/1.0; +http://webmeup-crawler.com/) 8 35% 
3% - 9 38% 
3% SemrushBot/1.2~bl; +http://www.semrush.com/bot.html) 10 41% 
2% Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 11 43% 
2% (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 12 45% 
2% bingbot/2.0; +http://www.bing.com/bingbot.htm) 13 47% 
2% Mozilla/5.0 14 49% 
2% MJ12bot/v1.4.7; http://mj12bot.com/) 15 51% 
1% MJ12bot/v1.4.8; http://mj12bot.com/) 16 53% 
1% MegaIndex.ru/2.0; +http://megaindex.com/crawler) 17 54% 
1% (Linux; Android 4.4.2; 5560S Build/KVT49L) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36 18 55% 
1% SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/) 19 56% 
1% (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 20 57% 
1% Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07) 21 57% 
1% MauiBot (crawler.feedback+wc@gmail.com) 22 58% 
1% (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0 23 59% 
1% AhrefsBot/6.1; +http://ahrefs.com/robot/) 24 59% 
1% (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36 25 60% 
version 2019/02/03 5:31pm EST - rc1 Project ID: iFWhOg8Dg5GS     ... DONE
Done. [NetBeans output]

commentary on the code

Note that the source code below is a balance between getting something out the door and being descriptive. One day, maybe I'll come back and comment it more.

the code

page history

HTML5 valid