COP 2344 (Shell Scripting) Project #6
Process Apache Web Server Access Log

 

Due: by the start of class on the date shown on the syllabus

Description:

In this project we will learn how to extract useful information from log files.  The Apache web server maintains an access log that records all successful accesses to that website.  The information includes some of the HTTP 1.1 headers.  These include the exact URL the user used to find your site.

The Google search engine is often used just before finding our websites.  When a user goes through Google and clicks the URL to visit our site, Google adds the search phrase used as part of the URL (in the query string).  It can be very useful to know what search phrase was used to find a website.

In this project you will create a pipeline of filter commands to extract the Google search phrase from an Apache access log.  You can use any filter commands you wish.

The data is in the file /var/log/httpd/access.log (on Linux systems), with one log entry per line.

Since this file is restricted to root access on YborStudent, a readable copy to use can be found on (on YborStudent) at ~wpollock/access.log.  Use this file for your project.

Requirements:

Create a one-liner (a shell pipeline or grouped command) that shows the Google search phrase used, and a count of how many times each search phrase was used, in order of most to least frequent.  You must use the access.log provided.  It is up to you to decide if you wish to use case-sensitive or case-insensitive matching of search terms.  (My model solution used case-insensitive.)

Additional Notes:

Examine the log file.  Firstly, develop a command that shows only the log entries (lines) with Google searches.  Such lines contain “google.” and then “/search” in the URL before the query string starts.  Be careful!  Some lines in the log contain the words google and search, but are not the lines you want.  Make sure your search uses the terms shown, including the period and the slash.

Next you need to extract the one field of interest from these lines.  Examine a typical line from the access log (note this is a single line; wrapped lines show as “➥”):

72.158.245.66 - - [24/Feb/2007:08:15:49 -0500]
➥ "GET /2005/artists/gustavo_matamoros.html HTTP/1.1" 200 6267
➥ "http://www.google.com/search?tab=vw&hl=en&q=gustavo%20matamoros"
➥ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1;
➥ .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

Notice which field contains the search URL, and notice how the search terms (“gustavo matamoros”) appear.

It helps to know something about URLs.  In the example above, the URL is this part of the log entry:

  http://www.google.com/search?tab=vw&hl=en&q=gustavo%20matamoros

Note it is surrounded with double quote marks in the log entry.  A URL has many parts (most are optional and may be missing):
The protocol part is “http://”,
the host part is “www.google.com”,
the path part is “/search”, and
the query string is everything after the question mark (“?”), or “tab=vw&hl=en&q=gustavo%20matamoros”.

The search terms entered by the user to Google are the part of the query string after “q=”, up to the end of the URL (marked by the close quote mark) or the next part of the query string (marked by an ampersand): “gustavo%20matamoros”.

Note this string is encoded; the search terms are “gustavo matamoros”.

After extracting this field, you need to extract the search phrase.  (If you extract the search phrase from the whole line, you might get the wrong data; it would depend on the rest of that line.  In this project, it is “good enough” to work on the whole line, without first extracting the correct field.)

To extract the search terms (words), it is helpful to know that this data is URL encoded (also called “percent encoding”).  Briefly, the data passed in a URL is called a query string.  It follows a question mark.  Each data item is separated with an ampersand (“&”).  Each item is of the form name=value.

Google puts the search phrase in an item with the name “q”.  You need to extract the value of this item (everything from the equals sign to the end of the item).  Note the end of the last item is marked not by an ampersand but the closing quote-mark.

Finally you still have to decode the result to know the search phrase.  Only a subset of ASCII is allowed in a URL.  In URL encoding, spaces are translated to plus signs (“+”).  The remaining non-ASCII characters are represented by a percent symbol (“%”) followed by the ASCII numeric value of the character (as a pair of hex digits).  So the search phrase “Hymie Piffl” will appear as “...q=Hymie+Piffl”, and “$10 + 7% tax” will appear as “...q=%2410+%2B+7%25+tax”.

While you can perform URL decoding fairly easily in Perl or Python, for this assignment you can take a short-cut and convert all occurrences of “percent<hex-digit><hex-digit>”, plus signs, and even any non-letters or digits, to a space.  This is because in this application, we only care about the words used in the search phrase.  (You might want to be careful to convert runs of such characters to a single space (or squeeze them later using tr), or else sorting and comparing will be harder; sort and uniq count spaces!)

Finally when you have the list of search phrases, you can do the usual “dance” to generate the most frequently used:  sort, count duplicates, and sort the result in reverse numerical order.  Remember, this project should result in a script that only shows the top ten most frequent results.

As a hint, here's the top five results found with the model solution, and some other information:

$ accesslog.sh |head -n 5
     30 hawk radio
     18 reef sponge
     17 moving pitchers
     16 the hawk radio
     14 reef sponges

$ accesslog.sh |wc -l
214

$ accesslog.sh |awk '{sum += $1}; END {print sum}'
502

To be turned in:

A copy of your pipeline or script, and the results of running it against the log file supplied.

You can type or send as email to .  Please see your syllabus for more information about submitting projects.