user agents - what is the point of all this?

Every request to a web server such as kwynn.com generates a line in the web server access log. A "user agent" is a part of that line that self-identifies the browser or robot making the request, such as

Mozilla/5.0 (Linux; Android 9; SM-S367VL) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.92 Mobile Safari/537.36

In part because I fell compelled to do (many) things myself, I have spent lots of time digging around kwynn.com's logs trying to find useful information. (This is a case of data mining or data science and Big Data techniques on a small data set.) My first question is how many lines / hits / requests self-identify as bots. Today's answer is 79%, which is close enough to previous estimates. After that, I am trying to figure out how many human beings read my site, and how engaged they are.

I could discuss this endlessly, but I'll move this along for now.

a whole line

A whole line looks like this:

66.249.64.199 - - [12/Dec/2021:16:37:53 -0500] 861050 "GET /t/21/09/no_vax_work.html HTTP/1.1" 200 649 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The 66.249... is the IP address of the client (in this case, the GoogleBot) making the request. More specifically, it is an internet protocol version 4, 32 bit address. There are also IPv6 128 bit addresses. Generally, publishing an IP address is one of a small number of items I would not want to publish. However, if you look up the "whois" information on that address, you'll see that it traces to Google. I don't think the GoogleBot needs its privacy.

The first '-' is a non-existence username for when you're using the Apache web server as a user manager, which is pretty much never done anymore, if it ever was done. I don't remember what the other '-' is; I'll link to the documentation below.

Then the date and 861050 is the number of microseconds after that date. I added microseconds in that it's non-standard.

"GET" refers to the HTTP GET method / command. Simple requests--without form data--are usually GETs. Then there is the URL relative to kwynn.com root (DocumentRoot), or https://kwynn.com/t/21/09/no_vax_work.html. (Actually, nothing in my log indicates http versus https.)

"200" indicates an HTTP response code of "200 OK"--the web server found the data and is returning it: no errors. 649 is the number of bytes returned with compression, usually gzip.

The next '-' means no referrer. A referrer can be a PNG request showing that the request came from the "parent" page (an internal referral), or that the Google search engine referred the user as the result of a search (external referral).

Then there is the user agent. I have shown you enough of those for you to come to your own conclusions.

Q&A

One of my apprentices asked some questions about this:

He asked about various issues around identifying humans versus bots. I link to my current bot test. Assuming it is accurate, it shows 79% self-identifying bots on my site, and that is roughly consistent with one other site I've studied.

Then you have bots that try to show as browsers in the user agent. Those can be identified a number of ways:

Usually a bot writer doesn't bother to update its fake user agent, so it shows a browser version that is way behind
Usually bots don't fetch pictures and favicons and JavaScript and other files that would load in a browser. This means they don't execute JavaScript.
Usually they do more fetches in a second than a human could possibly do.
Many bots only fetch the home page.
Some of them keep the same IP address and change the user agent with every fetch. I think some of them do the opposite to some degree, but that's a much fuzzier memory.

To answer some other questions of my apprentice, a bot may fetch any page on the site. They may be keeping records of previous fetches, so they may "know" large portions of the site and can fetch anything at any time. As for IP addresses, there is little reason to assume they will be in the same range from fetch to fetch. Bots can use the TOR network or other VPNs. As for a bot being less inclined to see new content, that's not true. Again, if they're keeping records, they may be constantly looking for new content.

As for why bots read the site, lots of reasons. The GoogleBot and its bretheren are updating their search indexes. Or they are compiling a list of their masters' enemies and only updating the ADL's records. Almost all of the known bot user agents list domains and such. You could quickly know why they do what they do better than I do. Just like Google, I know that they offer various services to business. Then there is academic research. Hacking is a reason. Some of the queries I can tell are hack attempts. I'm sure I'm missing some.

all versions of my user agent output:

other notes

The Apache log is often at /var/log/apache2/access.log

My .conf lines to generate the microtime log are

LogFormat "%h %l %u %t %{usec_frac}t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" micro
CustomLog ${APACHE_LOG_DIR}/access.log micro

My access.log dated 2021/10/23 - 2021/12/21 is 26MB.

Sometimes I list specific versions of code rather than the repository because I may rearrange the repo. You may have to back out of that version and go back to the main tree to find the current version.