Import historical data from Apache logs into AWStats

One of my clients had a problem where the last 6 months of data was not in Google Analytics. Upon investigating it turned out that for some reason the WordPress Google Analytics plugin was not active. I could not determine why it was not active when I am sure I set it up in the past.
I had all the Apache logs for the period in question so it seemed a simple idea to put the data into something useful that would show charts to the client. AWStats is perfect for that. In fact I used to use it long ago before Google Analytics was available but I had forgotten about it. As with all good open source software, the project is still there and ticking along.
Configuring AWStats turned out to be a but tricky. By default, debian sets AWStats up for a one domain host. My Apache logs are configured in the vhost_combined format which is one access.log file for all the virtual hosts.
The log files are rotated by logrotate and numbered access.log.1 access.log.2 access.log.3 .. access.log.10 etc. This presents another problem as you need to get them into order and normal alphabetical sorting does not work as there are no leading 0s in the file names.
Further, Apache was misconfigured and all the virtual host entries which should have indicated which virtual host was serving that access were in fact showing the ServerName. Luckily the entries do include the actual URL that was requested so with a bit of grep and sed it was easy to reconstruct what the virtual host should have been.
I wrote little bash script that would take a file name, either (eg access.log or access.log.gz) and would output that file after having parsed it to fix up the errors (later I discovered zcat -f will cat a file whether it is gziped or not so invalidating the need for the mycat function). You’ll see in the sed regular expression that I change the : to a space, AWStats does not like having a : between the hostname and the port and I could find no way to making AWStats parse that correctly. The reason there is two regex replacements in the sed command is that I fixed the apache logging of the host name prior to running this script, so needing to take into account both cases of old hostname and new hostname.
I could have made the sed regex taking into account the port number but I’m only interested in port 80 anyway and didn’t see the need to spend time on getting that working.
Log file format:

# Actual
old.host.name:80 199.7.156.141 - - [16/Sep/2012:17:25:51 +1000] "GET /wp-content/themes/grip/style.css HTTP/1.1" 200 7108 "http://correct.host.name/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; BTRS126493; EasyBits GO v1.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; eSobiSubscriber 2.0.4.16; InfoPath.2)"
old.host.name:80 199.7.156.141 - - [16/Sep/2012:17:25:51 +1000] "GET /wp-content/themes/grip/stylesheet/nivo-slider/nivo-slider.css HTTP/1.1" 200 968 "http://correct.host.name/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; BTRS126493; EasyBits GO v1.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; eSobiSubscriber 2.0.4.16; InfoPath.2)"
# Required for importing to AWStats
correct.host.name 80 199.7.156.141 - - [16/Sep/2012:17:25:51 +1000] "GET /wp-content/themes/grip/style.css HTTP/1.1" 200 7108 "http://correct.host.name/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; BTRS126493; EasyBits GO v1.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; eSobiSubscriber 2.0.4.16; InfoPath.2)"
correct.host.name 80 199.7.156.141 - - [16/Sep/2012:17:25:51 +1000] "GET /wp-content/themes/grip/stylesheet/nivo-slider/nivo-slider.css HTTP/1.1" 200 968 "http://correct.host.name/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; BTRS126493; EasyBits GO v1.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; eSobiSubscriber 2.0.4.16; InfoPath.2)"

catlogs.sh finds relevant lines in given log file and reformats them to be suitable for importing into AWStats and outputs to stdout:

#!/bin/bash
mycat() {
    local f;
    for f; do
        case $f in
            *.gz) gzip -cd "$f" ;;
            *) cat "$f" ;;
        esac;
    done;
}
mygrep() {
    #get all the lines from log file which have accesses to the correct.host.name
    mycat $1 | grep 'http://correct.host.name' | \
        sed -e 's/old.host.name:80/correct.host.name 80/ ; s/correct.host.name:80/correct.host.name 80/' # replace incorrect hostnames
}
mygrep $1

Then I needed to loop through all the access.log files in the apache log directory in historical order. To do that I wrote a simple for loop on the command line.

for i in $(ls /var/log/apache2/access.log* | sort -r -n -k 3 -t '.' ) ; do sudo -u www-data /usr/lib/cgi-bin/awstats.pl -showcorrupted -showsteps -LogFile="bash /home/jason/catlogs.sh $i |" -config=/etc/awstats/awstats.correct.host.name.conf ; done;

A nice thing with AWStats is you can pass in a command that outputs to stdout as the log file -LogFile="bash /home/jason/catlogs.sh $i |". I used sort to get the files into numerical order. sort’s -k and -t options let you sort by a “KEY”. The logs need to go from oldest at the top to newest at the bottom, so you have to process the files in reverse number order.
Lastly, to ensure AWStats can read the apache access logs in future, I changed the apache vhost_combined format to:

LogFormat "%V %p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined

and I changed awstats log format to:

LogFormat = "%virtualname %other %host %other %logname %time1 %methodurl %code %bytesd %refererquot %uaquot"

3 comments

Leave a comment

Your email address will not be published. Required fields are marked *