Your server can generate two or more log files with information about the transactions it completes. The most important information from a management standpoint is contained in the access log and the error log. If you want to know what’s happening on your site, look at your access log, which contains information about every completed HTTP transaction.
An access log file shows the visitor’s IP address, the date and time a page is requested, the name and location of the file requested, a status code, and the number of bytes transferred. Unfortunately, unless your site includes only a handful of simple pages and gets few visitors, simply browsing through the information in your access logs rapidly becomes impossible. A site with even a moderate volume of visitors and a few dozen pages will likely feature access logs with thousands of entries per week, rendering direct viewing of the entries impractical.
You can harness the power of your access logs by using software to analyze the data they contain. Most commercial Web servers, such as Microsoft’s Site Server and Netscape Communications’ Enterprise Server, include log file analysis tools.
There are also dozens of commercial log file analysis tools available at a relatively low price, such as WebTrends’ (Portland, OR) WebTrends ($299) and WebManage Technologies’ (Nashua, NH) NetIntellect ($199). There are also numerous freeware options that provide equally detailed reporting, albeit without the extensive documentation and support of the commercial products. One freeware example is wwwstat by Roy Fielding of the University of California, Irvine. A Perl script written for Unix-based systems, it generates detailed, table-based traffic reports in HTML format and integrates with gwstat another freeware program to graphically display site traffic.
Log file analyzers operate on the same principle: They parse each line in the access log, populate a database with the parsed data, and build reports based on a variety of queries. The most basic packages provide information such as total number of hits, impressions (pages viewed), least frequently and most frequently visited pages, number of kilobytes transferred, and number of client or server errors.
Most analyzers further report these measurements in daily, hourly, or even shorter time increments. Such finite detail provides valuable information about overall traffic patterns, peak traffic periods, data transfer rates, and problem files on your site. Table 1 (page 38), generated by WebTrends, shows average daily traffic, peak traffic periods, and low traffic periods.
Because of the stateless nature of Web transactions, log file analyzers necessarily make some assumptions about the data pulled from the access logs. For example, there might be a fixed (or possibly configurable) time span that is used to determine a session time-out. The log file analyzer identifies sequential page requests from one IP address as a single user session, and assumes a session time-out when it encounters a time gap greater than the fixed time span. In this way, the software can deliver relatively accurate information about the total number of user sessions (visits to the site), average session lengths, common paths through the site, and more.
Beyond the basic reporting options described previously, log file analyzers offer a variety of advanced features, many of which are not only desirable but possibly crucial to keeping your environment running smoothly. You need to take your site’s configuration into account, as well as any advanced reporting you and your users will require if you’re planning to add a log file analyzer to your site, or upgrade your existing software.
Ask yourself the following questions about your log file analyzer:
Can it read and synthesize extended log file formats? Most commercial servers let you capture extended information in your access logs, such as browser type and version. Make sure you know what options your server offers for capturing access statistics, and verify that the analysis software supports those formats as well.
Can it merge logs from multiple servers? If you are load balancing multiple servers, or need to merge traffic information from multiple sites, this option lets you measure traffic across all servers on the site.
What methods does it support for accessing access log files? For example, does it support direct file access, HTTP, FTP, or some combination of the three? Also, can it support name and password authentication for proxy or remote file access?
Does it include a scheduler function for prearranging future reports? If you need to generate periodic reports for one or more sites, report scheduling helps reduce your workload.
Does it support real-time reporting of log file data? If it’s important for you to have up-to-the-minute information about your site’s traffic, this feature is a must.
How customizable are the reports it generates? You won’t find a one- size-fits-all solution in any of the log file analysis packages on the market. The more flexibility you have to configure reports and filter report data by factors such as time, file type, and directory, the better. If you need heavy-duty reporting capabilities, look for a tool that can export the compiled data to an external database, allowing you to create customized reports and queries.
A typical server log file.
Other possible options to look for include automatic conversion of IP addresses to domain names, report delivery via e-mail, and the capability to output reports in non-HTML text formats, such as spreadsheet or word-processor formats.
In addition to the access log, your server also generates an error log file, which is a record of failed HTTP transactions such as unauthorized access or missing file errors.
It’s good practice to browse your error log on a periodic basis to find problems and security breaches on your site. Consecutive failed attempts to access secure areas of your site may indicate someone trying to exploit a security hole on the site. You can also quickly identify broken links, missing pages, and misplaced image files.
While there aren’t tools available to analyze error logs, the error logs are generally not as large as the access logs, and thus are easier to browse with a text editor.
Anyone involved with creating or maintaining content on a Web site understands how difficult it is to control the integrity of the site’s resources. Problems such as broken links, missing files, or poorly coded HTML crop up all too often in environments featuring multiple authors and internal and external resources that are in a constant state of flux.
Even with tight controls on authoring and administration practices, it’s inevitable that there will be structural problems scattered throughout a site. If this scenario describes your Web environment, it’s likely your production and development personnel devote a significant amount of time to locating and correcting such problems.
Help is available in Web content management tools, which help root out and correct problem files. Like log file analysis tools, most products are competitively priced, such as Tetranet Software’s (Kanata, Ontario) Linkbot ($249) and Site Technologies’ (Scotts Valley, CA) SiteSweeper Workstation ($295).
These tools, also known as link checkers or site mappers, provide quality control by sifting through a site in search of broken links, missing images, poorly constructed HTML code, and other problems. Most can display a visual roadmap of a site, making it easier to understand the site’s structure and find files or sections quickly. Figure 1 shows an example of a site map as implemented by Mercury Interactive‘s (Sunnyvale, CA) Astra SiteManager.
After directing the link checker to a site’s home URL, it retrieves the index page and searches for hypertext links, image tags, and other media links embedded within the code. It tests all the resources, verifying that they exist and load properly. The link checker follows the same procedure for each page linked to the index page, collecting information about the page’s resources and following any hypertext links it encounters. In this manner, it continues to sift through a site, building a catalog of information about all the files it processes.
Content management tools may offer a wide variety of functions, including testing for broken links, building a visual map of the site, locating orphans (unused files no longer linked to any pages in the Web site), identifying slow-loading pages, and possibly locating and repairing incorrectly coded HTML.
The available tools fall into two basic categories: products offering better mapping and navigational abilities, and products better suited to reporting and correcting errors.
As with log file analysis tools, you should know what features are needed for your environment and select the tool accordingly. If you’re thinking about adding content management software to your site, consider the following questions:
How does it help find and repair broken links? All these tools should help root out and repair broken links, but how they go about it differs. If broken links are the biggest problem on your site, look for tools that offer advanced error-reporting capabilities rather than intuitive visual mapping. Also, if you need to correct numerous errors on a regular basis, check to see that the tool can integrate with your third-party editing software, or that it includes built-in editing software.
Can it locate outdated files? If your site’s resources change or rotate frequently, look for a tool that can test and report the last time files were saved.
Can it test for slow-loading pages? If you’re concerned about alienating bandwidth-challenged visitors, this helps identify excessively large files that might result in aborted transactions. Some tools can help locate problem files based on your minimum bandwidth standard say, for a 28.8Kbit/sec or faster modem.
What errors can it detect in HTML code? This feature could help isolate and correct inconsistencies resulting from poor coding techniques. If this is a common problem on your site, you should find out to what extent the tool analyzes HTML files at the code level. Some common problems are duplicated title tags, incomplete or missing head information, and images with missing alt, height, or width tags.
Does it support selective scanning? Filtering by directory, file type, or number of link levels allows you to focus on a subset of your site, which might help reduce the time the tool needs to read and analyze the site.
How does it display a site’s structure? If your greatest need is for a tool that helps you visualize your site’s overall structure and organization, look for one with more extensive and intuitive mapping features.
How will the product scale as your Web site grows? If your site is already large or growing rapidly, you need a product that won’t get bogged down by thousands of documents. While vendors might claim that their products are scalable, you should test several such products on your site before making a purchase.
Can it support form-generated CGI content? Some tools let you preset variables to be entered on your site’s CGI forms, enabling you to test and map dynamically generated pages.
Overall, content management software products are still relatively immature but are evolving quickly; look for rapid upgrades and expanding feature sets in the coming year.
Although none of the currently available tools provides excellent support for all the requirements and functions we’ve discussed in this article, Webmasters trying to maintain large or frequently changing sites can benefit from the visualization and error-detection capabilities these tools provide. Most vendors offer downloadable evaluation versions of their products on the Web, so you’d be wise to try out several to find the one that best matches your site’s content management needs.
The number of Web management tools on the market has mushroomed within a relatively short time span. Competition is rampant, with vendors scrambling to expand their product lines to cover a wider range of management functions. This is good news for the consumer, who can expect more innovations and tools that integrate a growing number of solutions.
Although this trend should continue for a least another year or two, that doesn’t mean you should wait to get help managing your Web site. Most of the products available provide a wealth of information and utility at very affordable prices.
As long as you understand what these tools really offer, their limitations, and how well they integrate into your environment, you can reap substantial benefits from them today.
Phil Keppeler, Network Magazine’s Webmaster, can be reached at pkeppele@@mfi.com.
There’s a diverse range of products whose makers claim are Web management tools. If you’re planning to expand your site’s Web management capabilities, you need to understand what each type of tool manages, and whether it applies to the needs of your environment.
The phrase Web site management tools describes anything from basic authoring tools to large, enterprise-level Web development and deployment systems. A breakdown follows of how the market is currently categorized.
Integrated Web authoring tools. Examples are Haht Software’s (Raleigh, NC) Haht Site and NetObjects’ (Redwood City, CA) Fusion. In addition to providing a wide range of design and development functions, they help Webmasters manage site structure through site mapping and visualizing features.
Traffic analysis tools. These include products such as WebTrends’ WebTrends (Portland, OR) and net.Genesis’ (Cambridge, MA) net.Analysis Pro, which slice and dice a server’s log files for comprehensive site traffic and usage reporting.
Link checkers and site mappers. This category includes Mercury Interactive’s (Sunnyvale, CA) Astra SiteManager and Site Technologies’ (Scotts Valley, CA) Site Sweeper. These tools sift through a server’s content to create a map of a site’s structure and report on structural problems, such as broken links, within files.
Performance monitors. Examples are Avesta’s (Nepean, Ontario) Webwatcher and Network Associates’ (Santa Clara, CA) WebSniffer fall into this category. These utilities monitor the availability and performance of network resources. In addition to providing feedback about your network’s performance, they automatically alert you when network services fall outside of an acceptable range.
Bandwidth managers. Examples include RND Network’s (Mahwah, NJ) hardware-based Web Server Director and Resonate’s (Mountain View, CA) software-based Dispatch. These products can help those responsible for the network infrastructure to manage bandwidth and load over network resources. There are many hardware, software, and hardware/software combination solutions that distribute Web traffic among multiple servers in a local or distributed environment. Some help manage servers in such environments, and some help manage the content resources throughout the Web environment.
Workgroup management and version control tools. Two offerings in this category are MKS’ (Waterloo, Ontario) Web Integrity and Wallop Software’s (Foster City, CA) BuildIT. They provide file check-in and check-out, as well as version control for complex sites with a workgroup development environment.
Comprehensive enterprise development systems. Example offerings are Vignette’s (Austin, TX) StoryServer and Inso’s (Boston) DynaBase. Products in this category provide a complete framework for producing complex, data-driven sites that may include connections to back-end data sources, workgroup development environments, e-commerce, and more.