Regular Expression Sample Perl Script:
Parsing Web Log Files

 

Written 2/2005 by Wayne Pollock, Tampa Florida USA.

The file /etc/cron.daily/00webalizer is a shell script run each day automatically to analyze Apache web server logs.  (To see the results point your web browser to http://localhost/usage.)  One problem I've had was a repeated hacker attempt, apparently by some script kiddie, to trigger a buffer overflow in the web server by sending extremely long URLs.  Such URLs do not cause any problems with the Apache web server, however some tools such as webalizer have difficulty when the log files have very long lines in them.

To solve the problem, I use Perl to pre-process the web log file just before starting webalizer.  The first script looks for URLs with a single character that repeats many times.  The second looks for a short sequence of \x.., which means a backslash, an x, and any two characters.  In both cases most of the repeated URL is replaced with ...REPEAT....

The modified script that cron runs appears below.

#!/bin/sh -
# update access statistics for the web site
#
# /etc/cron.daily/00webalizer, modified by WP 8/04
# $Id: 00webalizer,v 1.1 2005/03/07 17:20:40 root Exp $

if [ -s /var/log/httpd/access_log ]
then

   # Trim long URLs with a repeating character:
   perl -pi -e 's/(.)\1{50,}.*(.{50})$/$1$1...REPEAT...$1$1...$2/' \
      /var/log/httpd/access_log

   # Trim long URLs with repeating sequences of '\x..':
   perl -pi -e 's/(\\x..(\\x..)?)\1{50,}.*(.{50})$/$1$1...REPEAT...$1$1...$2/' \
      /var/log/httpd/access_log

fi

/usr/bin/webalizer -Q