WebLog classes

Web logfile parsing and manipulation in Python

Download the WebLog-x.xx.tgz archive for the full set.

NameDescription
CHANGES 
EXAMPLES/ 
weblog-1.01.tgz 

WebLog Classes - Python Logfile Analysis Toolkit
------------------------------------------------
Version 1.0

(c) 1998 Mark Nottingham
<mnot@pobox.com> - bug reports, questions, comments

This software may be freely distibuted, modified and used,
provided that this copyright notice remain intact.
THIS SOFTWARE IS PROVIDED 'AS IS' WITHOUT WARRANTY OF ANY KIND.

Thanks to Ben Golding and Jeremy Hylton for their advice.

If you use the classes in an interesting or large application, please drop me
a line!

Introduction
------------
WebLog is a group of Python modules containing several class definitions that
are useful for parsing and manipulating common Web and Web proxy logfile 
formats. 

WebLog is reasonably fast, considering that it's written in a scripting
language. The parsing modules are especially well optimised; for example, 
the combined parser will process about 2500 lines a second on a Pentium 233,
on a Unix operating system.


Contents
--------
The modules can be broken up into two types; parsing and postprocessing. The
classes inside of them can be used by first using a parsing class and then
stacking postprocessing classes on top of it.

Parsing Modules:
common - Common (NCSA) Web log parser.
combined - Combined/extended Web log parser (adds referer and agent).
squid - Squid Web Proxy Cache log parsers (access.log, store.log v1.1).
multiple - combines log files of the same content from different servers.

Postprocessing Modules:
url - parses url and referer (if availalble) for components.
query - parses queries into dictionaries. *
clean - normalises attributes of Web Log for more accurate analysis. *
resolve - resolves client address to host and/or ip.
referer - determines type of hit: local, offsite, manual, or file. *
limit - limit output to certain domains, files, directories or times. *

* requires use of url.Parse first

The squid parsing module contains two classes; AccessParser (for
access.log), and StoreParser (for store.log). If you have full_mime_hdrs set
in squid.conf, make sure to set the corresponding attribute in AccessParser;
however, use of this will appreciably slow down analysis.


Installation
------------

To install the modules, put the weblog directory either in the same directory
as your application, or in the site-packages directory. If you do so, remember
to include a weblog.pth file in the top level; for instance;

% mkdir /usr/local/lib/python-1.5/site-packages   # if it isn't there
% mv weblog /usr/local/lib/python-1.5/site-packages
% touch /usr/local/lib/python-1.5/site-packages/weblog.pth

See the site.py module for more details. After doing this, the modules can
be imported in several ways, such as:

>>> import weblog			# referenced like: weblog.common.Parser
>>> from weblog import common			# referenced like: common.Parser
>>> from weblog.common import Parser	# referenced like: Parser


Use
---
One of the Parsing classes must always be used first and only once, and then
Postprocessing classes may be used on the resulting instance, if desired.

All of the classes define a method, getlogent(). This method will make the
next log line available through its attributes. It will return 0 when 
there are no more lines to process.

For full details of the classes and their interfaces, read the comments of 
the individual modules, as well as their __doc__ methods. Note that several 
of the postprocessing classes have specific requirements for their input.


Examples
--------
A WebLog class can be as easy to use as this, which will print how many hits
pages on your site get:

import weblog.common, sys
log = common.Parser(sys.stdin)
hits = {}
while log.getlogent():
	hits[log.url] = hits.get(log.url, 0) + 1
for (page, hit_num) in hits.items():
	print "%s %s" % (hit_num, page)

Several moderately more complex demo scripts come with the WebLog package
(in the EXAMPLES/ directory):

bad_passwords.py - identify bad HTTP authentication attempts.
referers.py - shows what referers go into your pages, by page and referer.
search_terms.py - shows what search terms are used to reach your pages on 
                  popular search engines.
squid_users.py - shows traffic through a cache by user and site.
log_watch.py - watches a logfile (i.e., 'tail -f').

The best way to learn to use the classes is to pick through the examples, as
well as the test() functions of each of the modules.