A Squid refresh_pattern analysis program
A paper and slides about Squij that was presented as a WIP at the International Web Caching Workshop 1999 are available.
If you have a platform request for a compiled binary, please tell me. If you have Python or can install it, please take pity on my link and get the Python version. Thanks.
squij-x.xx.tar.gz - Python source (requires Python 1.5 or greater)
squij-x.xx-[platform].tar.gz - platform-specific binaries.
The source distribution contains all needed modules.
|CHANGES||version change listing|
|squij-0.7.tar.gz||Python Source for squij|
Squij 0.7 (c) 1998 Mark Nottingham
Licensed under the GPL; see COPYING INTRODUCTION ------------ Squij is a program that looks at Web Proxy logfiles in Squid format and gives you information about how objects in the cache are accessed. Specifically, it will give the following statistics about how each of the object types defined in the refresh_pattern section: * average service time, in seconds * raw hit rate (in hits and bytes sent to clients) * fresh vs. stale objects for hits * modified vs. unmodified objects for stale objects * total hits (in hits, and bytes sent to clients) This information can be used to tune the refresh settings for your cache. For more details, see INTERPRETING THE OUTPUT. Squij can also give these statistics for a single origin server (web site), so that you can tell how cacheable it is. These statistics are for TCP traffic only, and do not include UDP. Squij is still experimental, and more features are planned. Feel free to contact me if you have any suggestions. REQUIREMENTS ------------ * A Web cache running Squid (version 1.x or 2.x) * Python 1.5 or better installed on the same machine * Full logs, not in httpd format Other caches (such as Network Appliance NetCache or Cisco CacheEngine) do supply log files in Squid format. However, they do not always use the same tokens to communicate the type of hit, and of course do not use a Squid-style configuration file. Because of this, they are not supported at this time, although it would probably by trivial to adapt Squij for them. INSTALLATION ------------ To use Squij, you must have Python 1.5 or better. See http://www.python.org/ Squij currently must be used on the same machine as squid is being run. To install Squij, unpack the tarball into any directory. That's it. If you like, you can place the libraries (Acct.py, Conf.py) in your site-packages directory; see site.py in the Python library directory (usually /usr/local/lib/python-1.5/) for more details. Optional step: If you want to place the weblog libraries in site-packages, move them as an entire directory and touch a .pth file for them in site-packages; i.e., % touch /usr/local/lib/python-1.5/site-packages/weblog.pth or follow the instructions in the weblog distribution at http://www.pobox.com/~mnot/script/python/WebLog/ USE --- Squij should be run by hand, like this: ./squij -c [path-to-squid.conf] e.g., ./squij -c /opt/local/squid/etc/squid.conf If you do not specify a path to squid.conf, Squij will assume it's at /usr/local/squid/etc/squid.conf The options to squij are: -a archive: use access.log.0 instead of access.log -c [location of squid configuration file] -h help message -i take logfile from stdin -n Squid version 1.1.x config format (default is version 2) -o [origin server] limit analysis to requests on a single server -s [start time in UTC, e.g., 905954667] If you're still using squid 1.1.x, you'll need to use the -n option. When run, Squij will parse the configuration file, determine where your logfile is, as well as what refresh_pattern's you have declared. After analysing the access logfile, it will print the results on STDOUT. WARNING FOR SQUID 1.1.x USERS ----------------------------- Because of the nature of the Conf module, refresh_pattern and refresh_pattern/i lines in squid.conf are parsed seperately, and then recombined in Squij. When this happens, lines within each configuration type are kept in the same order as the file, but the two types are separate, with the case insensititive section coming first. Therefore, the ordering of refresh_pattern and refresh_pattern/i in squid.conf must be carefully considered. In most cases, this should not cause any problems, but there may be instances where it interferes with the results. If this causes insurmountable problems for you, upgrade to Squid 2.x. INTERPRETING THE OUTPUT ----------------------- Output is currently presented as a simple ASCII table. There is one row for each refresh_pattern found in your squid.conf file. In order, the columns are: REGEX - the pattern. 'i' is appended if it is case-insensitive. AVE SVC TIME - time (in seconds) that it takes to send these objects to the client, in seconds. This includes objects satisfied from the cache as well as from the network. HIT/BYTE RATE - hit rate, in hits and bytes, for that object. FRESH/STALE - ratio of fresh hits vs. stale hits for the pattern. UNMOD/MOD - ration of stale hits that were unmodified on the origin server, against those that were modified. TOTAL HITS/BYTES - total number of hits and bytes seen for the pattern. The last row is of overall statistics for each column, for all content. * note that byte hit rates are those sent to the client; client IMS hits may cause this to be inaccurate. * if 0 is in either side of one of the ratios, it means that there was no traffic seen for that item. So, how do you use this? Hit rate and total hits are merely metrics for how much a pattern is used, and how effectively the matching objects can be cached. They allow you to determine what patterns are worth working with, and which ones may need to be split into separate patterns. Fresh/stale tells you how the refresh parameters are performing; a higher fresh ratio means that more requests are being satisfied directly from the cache. Unmod/mod compares how many stale hits that were checked (with an IMS) on the origin server are modified. If there is a high ratio of unmodified stale hits, it may be good to raise your refresh thresholds. On the other hand, if there is a high number of modified hits, it indicates that your thresholds are too high, and are more likely to be modified when your cache still believes that they are fresh. It is a good idea to aim to keep unmod/mod at 1:1 or with a slightly higher unmod number. For example: regex hit rate fresh/stale unmod/mod total ------------------------------------------------------------------------------ \.gif$ 25% ( 14%) 5:2 1:1 19357 ( 48709k) \.jpg$ 16% ( 19%) 15:2 3:1 1990 ( 24105k) \.htm$ 29% ( 29%) 1:1 3:4 1110 ( 9311k) \.html$ 21% ( 24%) 1:2 2:11 4099 ( 27138k) \.exe$ 9% ( 12%) 1:0 0:0 19 ( 42313k) \/$ 48% ( 61%) 2:15 1:5 3407 ( 35211k) . 7% ( 2%) 1:1 1:3 6049 ( 206117k) OVERALL 24% ( 14%) 1:1 1:1 36877 ( 355795k) .gif traffic has very good statistics; the hit rate, total traffic and fresh ratio are all high, and unmod/mod is 1:1, which is about where we want it. .jpg traffic is also good, but could possibly benefit from even higher refresh thresholds. .htm and .html traffic is fresh fairly often, but is usually modified when it becomes stale; this indicates that we should consider scaling back those patterns. All cache hits to .exe objects were fresh. The default pattern ('.') is being used a fair amount; it may be worthwhile to try more precise patterns. * The output of squij is still experimental, and unproven. Currently, UDP (inter-cache) traffic is NOT included; only HTTP (client) traffic is measured. ABOUT THE SOFTWARE ------------------ Squij uses the following software: - Python 1.5 - The best scripting language on the planet http://www.python.org/ - WebLog modules - Web logfile parsing modules http://www.pobox.com/~mnot/script/python/WebLog/ - Conf module - Configuration file parsing class http://www.pobox.com/~mnot/script/python/Conf/ CONTACT ------- Please contact me with any bugs, suggestions or questions. http://www.pobox.com/~mnot/