Squij

A Squid refresh_pattern analysis program

A paper and slides about Squij that was presented as a WIP at the International Web Caching Workshop 1999 are available.

If you have Python or can install it, please take pity on my link and get the Python version. Thanks.

squij-x.xx.tar.gz - Python source (requires Python 1.5 or greater)
squij-x.xx-[platform].tar.gz - platform-specific binaries.

The source distribution contains all needed modules.

NameDescription
CHANGESversion change listing
squij-0.7.tar.gzPython Source for squij
Squij 0.7
(c) 1998 Mark Nottingham 
Licensed under the GPL; see COPYING


INTRODUCTION
------------
Squij is a program that looks at Web Proxy logfiles in Squid format
and gives you information about how objects in the cache are accessed. 
Specifically, it will give the following statistics about how each of the 
object types defined in the refresh_pattern section:

* average service time, in seconds
* raw hit rate (in hits and bytes sent to clients)
* fresh vs. stale objects for hits
* modified vs. unmodified objects for stale objects
* total hits (in hits, and bytes sent to clients)

This information can be used to tune the refresh settings for your cache.
For more details, see INTERPRETING THE OUTPUT.

Squij can also give these statistics for a single origin server (web site), 
so that you can tell how cacheable it is.

These statistics are for TCP traffic only, and do not include UDP.

Squij is still experimental, and more features are planned. Feel free to 
contact me if you have any suggestions.


REQUIREMENTS
------------
* A Web cache running Squid (version 1.x or 2.x)
* Python 1.5 or better installed on the same machine
* Full logs, not in httpd format

Other caches (such as Network Appliance NetCache or Cisco CacheEngine) do
supply log files in Squid format. However, they do not always use the same
tokens to communicate the type of hit, and of course do not use a Squid-style
configuration file. Because of this, they are not supported at this time, 
although it would probably by trivial to adapt Squij for them. 


INSTALLATION
------------
To use Squij, you must have Python 1.5 or better. See http://www.python.org/

Squij currently must be used on the same machine as squid is being run.

To install Squij, unpack the tarball into any directory. That's it. If 
you like, you can place the libraries (Acct.py, Conf.py)
in your site-packages directory; see site.py in the Python library directory
(usually /usr/local/lib/python-1.5/) for more details.


Optional step:
If you want to place the weblog libraries in site-packages, move them as
an entire directory and touch a .pth file for them in site-packages; i.e.,
% touch /usr/local/lib/python-1.5/site-packages/weblog.pth

or follow the instructions in the weblog distribution at
http://www.pobox.com/~mnot/script/python/WebLog/


USE
---
Squij should be run by hand, like this:
./squij -c [path-to-squid.conf]

e.g.,
./squij -c /opt/local/squid/etc/squid.conf

If you do not specify a path to squid.conf, Squij will assume it's at
/usr/local/squid/etc/squid.conf

The options to squij are:
-a archive: use access.log.0 instead of access.log
-c [location of squid configuration file]
-h help message
-i take logfile from stdin
-n Squid version 1.1.x config format (default is version 2)
-o [origin server] limit analysis to requests on a single server
-s [start time in UTC, e.g., 905954667]

If you're still using squid 1.1.x, you'll need to use the -n option.

When run, Squij will parse the configuration file, determine where your
logfile is, as well as what refresh_pattern's you have declared. After
analysing the access logfile, it will print the results on STDOUT.


WARNING FOR SQUID 1.1.x USERS
-----------------------------
Because of the nature of the Conf module, refresh_pattern and
refresh_pattern/i lines in squid.conf are parsed seperately, and then 
recombined in Squij. When this happens, lines within each configuration
type are kept in the same order as the file, but the two types are separate,
with the case insensititive section coming first.

Therefore, the ordering of refresh_pattern and refresh_pattern/i in
squid.conf must be carefully considered. In most cases, this should not cause
any problems, but there may be instances where it interferes with the results.

If this causes insurmountable problems for you, upgrade to Squid 2.x.


INTERPRETING THE OUTPUT
-----------------------
Output is currently presented as a simple ASCII table. There is one row
for each refresh_pattern found in your squid.conf file.

In order, the columns are:

REGEX - the pattern. 'i' is appended if it is case-insensitive.
AVE SVC TIME - time (in seconds) that it takes to send these objects to the
client, in seconds. This includes objects satisfied from the cache as well
as from the network.
HIT/BYTE RATE - hit rate, in hits and bytes, for that object.
FRESH/STALE - ratio of fresh hits vs. stale hits for the pattern.
UNMOD/MOD - ration of stale hits that were unmodified on the origin server,
against those that were modified.
TOTAL HITS/BYTES - total number of hits and bytes seen for the pattern.

The last row is of overall statistics for each column, for all content.

* note that byte hit rates are those sent to the client; client IMS hits may 
cause this to be inaccurate.

* if 0 is in either side of one of the ratios, it means that there was
no traffic seen for that item.

So, how do you use this? 

Hit rate and total hits are merely metrics for how much a pattern is used, 
and how effectively the matching objects can be cached. They allow you to 
determine what patterns are worth working with, and which ones may need to 
be split into separate patterns.

Fresh/stale tells you how the refresh parameters are performing; a higher
fresh ratio means that more requests are being satisfied directly from the
cache.

Unmod/mod compares how many stale hits that were checked (with an IMS) on the
origin server are modified. If there is a high ratio of unmodified stale hits,
it may be good to raise your refresh thresholds. On the other hand, if there
is a high number of modified hits, it indicates that your thresholds are too
high, and are more likely to be modified when your cache still believes that
they are fresh.

It is a good idea to aim to keep unmod/mod at 1:1 or with a slightly higher
unmod number.

For example:

               regex      hit rate  fresh/stale  unmod/mod       total
------------------------------------------------------------------------------
             \.gif$       25% ( 14%)     5:2     1:1       19357 (     48709k)
             \.jpg$       16% ( 19%)    15:2     3:1        1990 (     24105k)
             \.htm$       29% ( 29%)     1:1     3:4        1110 (      9311k)
            \.html$       21% ( 24%)     1:2     2:11       4099 (     27138k)
             \.exe$        9% ( 12%)     1:0     0:0          19 (     42313k)
                \/$       48% ( 61%)     2:15    1:5        3407 (     35211k)
                  .        7% (  2%)     1:1     1:3        6049 (    206117k)
             OVERALL      24% ( 14%)     1:1     1:1       36877 (    355795k)

.gif traffic has very good statistics; the hit rate, total traffic and fresh
ratio are all high, and unmod/mod is 1:1, which is about where we want it.

.jpg traffic is also good, but could possibly benefit from even higher 
refresh thresholds.

.htm and .html traffic is fresh fairly often, but is usually modified when
it becomes stale; this indicates that we should consider scaling back those
patterns.

All cache hits to .exe objects were fresh.

The default pattern ('.') is being used a fair amount; it may be worthwhile
to try more precise patterns. 

* The output of squij is still experimental, and unproven. Currently, UDP 
(inter-cache) traffic is NOT included; only HTTP (client) traffic is measured.




ABOUT THE SOFTWARE
------------------
Squij uses the following software:

- Python 1.5  - The best scripting language on the planet
  http://www.python.org/
- WebLog modules - Web logfile parsing modules
  http://www.pobox.com/~mnot/script/python/WebLog/
- Conf module - Configuration file parsing class
  http://www.pobox.com/~mnot/script/python/Conf/


CONTACT
-------
Please contact me with any bugs, suggestions or questions. 

http://www.pobox.com/~mnot/