Follow 2.04 FAQ (version 1.1) (c) Copyright 1998 Mark Nottingham http://www.pobox.com/~mnot/follow2/ ### Follow Frequently Asked Questions (and answers) ### ## 0.0 General information 0.1 How can I get help with Follow? 0.2 Do I have to pay for Follow? 0.3 How do I report a bug? 0.4 Will there be a new version? What features will it have? 0.5 What language is Follow written in? ## 1.0 Installation 1.1 Which distribution should I get? ## 2.0 Configuration 2.1 Why do there have to be two copies of the config file? 2.2 My site has several DNS names pointing at it; how do I configure for that? 2.3 How do I set up a cron job? 2.4 I can't use cron on the server. Can I still use Follow? 2.5 I'd like to tell follow-gather where the configuration file is. 2.6 I have several virtual servers on one machine; can they all use follow? ## 3.0 Combined Logfile Format 3.1 What is the Combined Logfile Format? 3.2 How do I set up Apache to use the Combined Format? 3.3 How do I set up Netscape servers to use the Combined Format? 3.4 How do I set up Microsoft IIS to use the Combined Format? 3.5 How do I set up other servers to use the Combined Format? ## 4.0 Common Problems 4.1 follow.cgi complains about not finding the config file. What's wrong? 4.2 When I view a certain date, follow.cgi complains that it can't find a file. 4.3 There aren't any listings for a day, but my site got hits. What happened? 4.4 I've set up follow.conf as a symlink, and it doesn't work. ## 5.0 Interpreting and Tuning the Output 5.1 What do the different types of hits mean? 5.2 What do the byte (page and image size) totals mean? 5.3 What are destination hits? 5.4 Follow says that a page got x hits on a day, but I checked and it got more. 5.5 My cgi programs have strange statistics associated with them. ## 6.0 How Follow Works 6.1 Why do two programs (follow and follow.cgi) have to run? 6.2 What does Follow do with client and server errors (4xx and 5xx status)? 6.3 What happens when Follow detects a 'BACK' hit? 6.4 Why should I ignore Web spiders? 6.5 How does Follow deal with proxy and client caches? 6.6 How is the average page time calculated? ######################################################################### ######################################################################### ## 0.0 General information 0.1 How can I get help with Follow? First, check the FAQ, online help and README completely. Then, have a look at the Web page, at http://www.pobox.com/~mnot/follow2/ Try the -h and -v options on follow-gather. Currently, I don't support Follow; I really haven't worked on it for years, so I wouldn't know where to start. 0.2 Do I have to pay for Follow? No. As of Follow 2.04, it's free, and distributed as Python source. 0.3 How do I report a bug? You don't ;) If there's something critically wrong with Follow, report it to me, and I'll fix it if I have time, but generally development on Follow has stopped. Feel free to modify Follow in any way you'd like, as long as you respect the terms in the license. 0.4 Will there be a new version? Not by me. If you'd like to take over development of Follow, please shoot me an e-mail. 0.5 What language is Follow written in? Python ( see http://www.python.org ). When I went to do the new version, I found that Perl was lacking in some respects (particularly, in support for complex data structures and persistance). While I could do what I wanted to in Perl, it was ugly and very difficult to maintain. Python was like a breath of fresh air. It took very little effort to learn coming from Perl, and there are numerous advantages to using it. I'd encourage any Perl users out there to give it a try; you'll be surprised. ## 1.0 Installation 1.1 Which distribution should I get? As of Follow 2.04, the distribution is Python source. ## 2.0 Configuration 2.1 Why do there have to be two copies of the config file? This was deemed the easiest way around a problem. Each of the binaries needs to have access to the file, but there isn't a standard place that they could be put that would be accessable both to the regular user and the Web user on every system. So, both programs look for the config file in their current directory by default. 2.2 My site has several DNS names pointing at it; how do I configure for that? As of Follow 2.03, it is possible to specify multiple siteurls; see follow.conf for an example. 2.3 How do I set up a cron job? On UNIX, run: crontab -e which will bring up your crontab in your favourite editor (assuming you've set $VISUAL). Next, add a *single* line (the \ is to split it across two), like this: 05 00 * * * /home/bob/bin/follow-gather 1>/dev/null 2>&1 | \ /bin/mail bob -s "Follow Output" This will run the program /home/bob/bin/follow-gather at 5 minutes after midnight every day, and mail any errors to the user 'bob' (remember, all errors are on STDERR). For more help, see the cron(8) and crontab(5) man pages. 2.4 I can't use cron on the server. Can I still use Follow? Yes. You must run follow-gather by hand every time you want to analyse a new chunk of logfile; it is best to do this right before the logfile is rotated. 2.5 I'd like to tell follow-gather where the configuration file is. Use the '-c' command line option; for instance, follow-gather -c /httpd/cgi-bin/my_follow.conf will have follow use /httpd/cgi-bin/my_follow.conf as the config file. Note that this works only for follow-gather, not follow.cgi. 2.6 I have several virtual servers on one machine; can they all use follow? Yes. By having different config files and cache directories for each server, you can run follow-gather with the '-c' option to run them in turn. See question 2.5. Note that each server should have it's own follow.cgi and conf file in it's cgi-bin directory. If you run a Web Hosting service and this won't work on your setup, mail me; there may be a way to customise the output script. ## 3.0 Combined Logfile Format 3.1 What is the Combined Logfile Format? The Combined Format is a Web logfile that has information about the user agent and referer for each hit appended to each line. Most modern Web servers should be able to produce this with little trouble; consult your documentation for more details. 3.2 How do I set up Apache to use the Combined Format? In Apache, the configurable logging module (mod_log_config) lets you specify what format logs will be in. You can determine if your Apache has this module included by running the Apache binary with the '-l' option: /usr/local/apache/httpd -l Compiled-in modules: http_core.c mod_env.c mod_log_config.c ... In the example above, the server has mod_log_config. If yours does not, encourage your Sysadmin to upgrade to the latest Apache; mod_log_config is enabled by default in recent Apache distributions. Once you're sure that you have configurable logging enabled, add this line to your server configuration (either in srm.conf or in your section): LogFormat "%h %l %u %t \"%r\" %s %b \"%{referer}i\" \"%{user-agent}i\"" Then, send a HUP to the server; kill -HUP `cat /var/log/httpd/httpd.pid` 3.3 How do I set up Netscape servers to use the Combined Format? Use the Administration server to set logging options, and restart the server. [if you can give more detailed instructions, please mail me] 3.4 How do I set up Microsoft IIS to use the Combined Format? To my knowledge, IIS does not support custom logging. A pity. 3.5 How do I set up other servers to use the Combined Format? Consult your documentation. Generally, you need to add two fields to the standard (NCSA) format: the referer and the user-agent, both with quotes around them. ## 4.0 Common Problems 4.1 follow.cgi complains about not finding the config file. What's wrong? There must be a copy of the config file in the same directory as the cgi script. It must be called follow.conf, and it must be readable by the user that runs the Web server (usually nobody, www, web or the like). 4.2 When I view a certain date, follow.cgi complains that it can't find a file. This is most likely because the cache directory (and/or files in it) have permissions that don't allow reading by the Web user; see question 4.2. 4.3 There aren't any listings for a day, but my site got hits. What happened? First, if your logfiles aren't in Combined format, follow-gather will not be able to read them. Try running it with the -v (verbose) option; if there are no processed lines, but many corrupt lines, this is most likely the problem. Follow doesn't list every hit; it only show hits which it can verify as being part of a session from a client that isn't ignored. Follow is about finding out how people use a site, not counting hits. 4.4 I've set up follow.conf as a symlink, and it doesn't work. Don't use a symlink; make a copy of the file. ## 5.0 Interpreting and Tuning the Output 5.1 What do the different types of hits mean? FOLLOW - a hit that can be verified as coming from the previous page in the session. Generally, the user followed a link on the previous page. BACK - the user pressed the 'back' button on their browser one or several times to arrive at the destination page. BROKEN - follow cannot account for how the user arrived at this page; this indicates a conflicting referer and destination URL, and usually means that a Web proxy cache is being used. OFFSITE - The page was reached from a non-local URL. MANUAL - The user typed in the URL manually, or used a bookmark to reach the page. RELOAD - The user hit the 'reload' button to to refresh the page; either they though that the content was stale, or did not receive the complete page. 5.2 What do the byte (page and image size) totals mean? The page byte total is the average number of bytes served when this page is requested; in practice the number of bytes in the page. The image byte figure is a bit trickier; it lists the average number of bytes associated with images that users request when they come to the page. Because users may have some images cached from previous page requests (ones that have used the same images), this number is not necessarily the total of the sizes of images on the page. The image bytes figure is useful, though, because it gives an indication of how many image requests the average client has to make to use the page, and therefore how long they have to wait for the download. 5.3 What are destination hits? They are those that are the last hit in the session for a user. Destination hits are one measure of which pages have information that users are looking for (but it's also a measure of where they give up looking for information). 5.4 Follow says that a page got x hits on a day, but I checked and it got more. See question 4.4. 5.5 My cgi programs have strange statistics associated with them. Because cgi programs can redirect the user to another page, or produce their own page, you may get 'BACK's in error. These should be ignored. ## 6.0 How Follow Works 6.1 Why do two programs (follow-gather and follow.cgi) have to run? follow-gather parses the logfiles and does a three-pass analysis of them, storing the results in three databases in the cache directory. follow.cgi is the viewer for the databases. 6.2 What does Follow do with client and server errors (4xx and 5xx status)? Currently, Follow ignores them. This may change in the future. 6.3 What happens when Follow detects a 'BACK' hit? It will 'travel' backwards in the session, looking to see if it can identify where the hit did come from. If it can, it will insert a 'BACK' hit between the real hits to fill in the gap. If not, it will fall out as a 'BROKEN' hit. 6.4 Why should I ignore Web spiders? Web spiders introduce misleading sessions because, well, they're not human. Most spiders (or 'crawlers', etc) will traverse the site and send back a '-' referer, which follow interprets as a manual hit. 6.5 How does Follow deal with proxy and client caches? Generally, it doesn't. Because it generates a unique name for each client (not only IP), there is only a very small chance that two separate clients behind a cache will be interpreted as one. However, proxy caches may cause 'BROKEN' links, because parts of sessions will be cached. To cancel out this effect, configure follow to ignore sessions that have broken links. If your site specifies expiration times for objects (for instance, with mod_expires on Apache, or with meta files on NCSA httpd), you may find that user sessions are less than complete, because a higher proportion of them will be cached. To do so, you may want to diable this function for a short period of time to get a better idea of how your site is being used. 6.6 How is the average page time calculated? Page time is calculated for each page in the session except the last page; there is no way to determine time spent on a page from the logfiles. Note that page times are to be taken with a grain of salt; users can do anything (including sleeping) between page requests.