/home/wpollock1/public_html/restricted/ShellScripting/Proj-6 accesslog solution.txt

#!/bin/bash

# Script to parse the Apache web server's access log,
# to find Google searches and to pull out the search
# terms used to find your site.
#
# Original script written by Dave Taylor, in the February 2007
# issue of "Linux Journal", in his column "Work the Shell",
# pages 32-34.  (C) 2007 by Belltown Media.
# Adapted 2/2007 by Wayne Pollock, Tampa Florida USA.
# $Id: accesslog.sh,v 1.3 2014/04/08 05:51:55 wpollock Exp $

# The referrer string, containing the search terms, is the
# second quoted string in the apache standard access log.
# By using awk with a field separator of '"' the referrer
# field is the 4th field.
#
# The lines we want are from google.com, but may not
# contain that term; sometimes "google.com.uk" or
# "google.co.au" or some such is found.  What we do know
# is that the referrer field will have google.<something>,
# followed by "/search", followed somewhere by "[?&]q=terms".
# Note, there are many query parameters ending with a "q",
# so you must be careful to extract only when the whole
# parameter is just "q", and not "aq" or something else.
# The easy way is to note that parameters follow either "?"
# or "&".
#
# Next, it was easier for me to use sed to extract the
# search terms, which follow "q=" up to an "&" or a double-quote.
# (Note how the awk finds the correct lines, but prints all
# of the line.  This is because if you just printed field 4,
# the BREs of sed would be very hard to get right: the query
# terms would end with "&", '"' (double quote), *or* the end
# of the string.  But BREs don't support "OR"!  It was easier
# to print the whole line; now the query string ends with "&"
# or a double-quote.)
#
# Another sed command does URL decoding ("%??", which are
# ignored by translating them to spaces), and also removes
# blank lines.  The final sed removes extra blanks in the front,
# end, or middle of the string.
#
# Finally I use case-insensitive sorting for the final
# results, to show the search terms used by frequency:

LOG=${1:-$HOME/access.log}
printf 'parsing %s...\n' "$LOG"

awk -F\" '$4 ~/google.*\/search.*[&?]q=/ {print}' "$LOG" \
  | sed -ne 's/^.*[&?]q=\([^&"]*\)[&"].*$/\1/p' \
  | sed 's/+/ /g;s/%[[:xdigit:]]\{2\}/ /g;/^ *$/d' \
  | sed 's/^ *//;s/ *$//;s/  */ /g' \
  | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn