Home > Web > General Internet >

Web Statistics

Posted Mar 30, 2009

Last Updated Nov 12, 2011

Almost anytime I discuss websites with clients and others interested in websites, I often hear mention of or questions about how many hits a website gets. Oftentimes someone will brag about how many hits their website gets; conversely, they often ask how many hits my sites get. I understand what they are talking about... but in the back of my head I am smiling because I know that they are not talking about hits whatsoever.

Unless your website is purely text, and does not have embedded objects such as images, flash files, videos, javascripts, style sheets, etc (all the elements that standard website use to liven up pages), then you are not going to get a very accurate picture of your traffic by paying attention to hits. In fact, the more elements on a page (the richer your pages are with images, etc), the less useful hits are in determining anything about how much traffic you are getting. The reason this is the case is that your server is going to count every single call to every referenced file as a hit. Thus, a single visit to a single web page with ten pictures is going to result in your server recording at least eleven hits—one for the web page call, and one for each of the ten images. If there are externally referenced style sheets or JavaScripts, that number will go up.

Generally speaking, when someone is concerned about traffic, they are really concerned about visits. You want to know how many people came to your site. When you look at your statistics, you want to pay more attention to visits and (in most cases) very little attention to hits¹. The first time someone comes to your site, their visit increases your server's visit count by one but should not increase the count any more as they browse your site.

At the bottom of this article you can find links with more detailed information about the elements of web statistics. The rest of this article is a discussion on some of the problems and challenges facing the collection of traffic data as well as interpreting that data.

Data Collection

You would think that the collection of statistical data is straightforward. It's not. In fact, there is very little hard data that comes from your statistical logs; furthermore, even highly advanced systems, such as Google Analytics, give only a partial view of your traffic.

The reason that statistical data is hard to generate is that there are only a few pieces of hard data collected when someone makes a call to your website. Those things are listed below:

A computer called a file on your site
The file was called at a specific time
The call was made from a specific IP address²

That is really all you can find out about any call to your site. And yet, most web statistics programs have a long list of data that can be analyzed such as what page sent a visitor to your site; what country the visitor is from; what browser the user is using; what operating system, etc. How is this possible?

The reason is that browsers communicating with a web server send a bunch of data that includes all this information. The problem is that there is not a single bit of validation—and all of the data supplied by the browser client can be spoofed. You can easily tell your browser to tell web servers that you are using a totally different browser than what you are using. The Safari web browser, for example, has built in menus for reporting a wide range of user agents (browsers); most other browsers have plugins that let you modify the user agent. Bots, spiders and farm applications (programs that automatically index and browse your site) often report that they are normal user agents instead of robots. You could, if you wanted to, report to the web server that you are browsing with the Hamster Fur Lightning 3.0 Web Browser. The point is that web servers have to accept this information on faith that the visitor is being honest.

Other information such as referral sources (links that sent a visitor to a page on your site) can also be spoofed or (in many cases) stripped. Many security software applications will strip this information from access by a server for, among other things, privacy concerns. Many visitors that come to your site will not tell you that they came from a link on a partner's website—because security software stripped that information—making it problematic in finding out where your traffic is always coming from.

Web Data Stat Services

Following is a brief discussion and comparison of on-server statistical applications and off-server programs that collect traffic data and make reports on the traffic.

On Server Applications

There are many ways to collect web statistics data. The most common way is directly on the server via pre-installed statistical applications. A couple common applications are Webalizer and Awstats. These programs will read server logs, compile data and generate reports based on the information. Of course, as noted above, much of their data must be taken with a grain of salt—they rely heavily on the data supplied by visitors. These programs are useful as they have direct access to your access logs across your entire site and for specific files in a way that many off-server programs do not have (see below).

One of the problems that an on-server application has is that it cannot see wider trends that widespread systems have. For example, if a new spider comes out that starts indexing your site from a range of IP addresses, the on-server program may not immediately recognize the new spider as a spider—as a result, it reports it as an actual human visitor. You may notice that your web server statistics shows sudden spikes in traffic but without a commensurate spike in user-feedback—that is likely a spider or bot that your statistical software did not recognize.

Off Server Applications

Off-server statistical programs are systems that are run off of your web server. One popular program is Google Analytics. The benefit of these types of programs are that they are often run by companies that have a large set of resources (in equipment, manpower, money and data) that allows them to give very informative views of your traffic. One result that good off-server statistical programs offer is their ability to quickly identify non-human traffic and filter it. This means, for example, that a new spider indexing your site will not erroneously show up as a human that viewed every page on your site.

Another benefit of off-server programs (especially widespread ones such as Google Analytics) is that they increase the likelihood of getting accurate referral data—if the referral link was from a site that uses the same off-server service that your site is using, that software can still track referrals even when security software is stripping referral data.

As slick as these off-server statistical programs are, they do have one major caveat. They only work if the browser and/or security software allows them to run. Any visitor who turns off JavaScript, for example, will not show up in the Google Analytics data. While the trend seems to be that many users turn off JavaScript less and less as the Internet matures, upwards of five percent of web users were disabling JavaScript globally as of January 2008. That number could be higher for your site depending on the type of visitors you have—for example, if your readers are of a more paranoid crowd, you may find that a larger number of them have JavaScript turned off—in which case they would never show up in your Analytics data.

Another problem with programs such as Google Analytics is that they only detect calls to a web page and calls made from a web page that have the analytics JavaScript code embedded into them. What this means is that anyone who types into their address bar the URL of a video file, image, JavaScript or any other file instead of a web page with the embedded analytics code will not make an appearance in your statistics results. An example of this is if someone is using a picture from your site inside one of their own web pages—your picture is being viewed but the analytics software will not know of it (whereas on-server statistics will show this hit/referral). Another way to understand this is that the analytical software will only collect data for web pages into which you have put the required code—so if you miss a web page, then it is not used in collected data.

Hit Counters

As mentioned earlier, a hit can be misleading. Still, hits can be useful data... and are often used by tools called hit counters. These are either on-server or off-server simple programs that increment every time a page or file is hit. While a hit can be useful, the use of off-server hit counters is not recommended unless you really trust the service provider.

Hit counters are notoriously unreliable. First of all, most hit counters will increment every time a page is hit—meaning someone could simply go to a page and hit refresh ten times and the counter goes up; bots could index the page a few times in a row. That hit number then becomes meaningless.

Even if the service provider gives more detailed statistical information (such as referral sources, etc) it is not recommended to use hit counters. Reasons include performance and security. If a hit counter is provided by a company whose server begins to bog down (due to too many of their users' websites getting hit all at once) pages on your site could start loading more slowly. Worse, if the hit counter is provided by an untrustworthy source, then you are asking for trouble. One of my clients used a hit counter that was embedded into sites using iframes—the referenced web page in the iframe that displayed the hit counter image was infected with a virus—so all visitors to this client's site were getting security warnings when they viewed the site; worse may have happened to visitors that were not protected with up-to-date browsers or security software.

I recommend that you do not use hit counters on business sites; if you must have a hit counter, you should use an on-server program. As they are very simple to program, you should be able to hire a programmer to build a hit counter that fits your needs with minimal cost. Still, my recommendation is to simply pass up hit counters and learn to use your on-server statistical software as well as off server services to know how many visitors and hits you are getting.

Conclusion

Understanding the traffic on your website is vital. Knowing who your visitors are, what they are coming for, where they are coming from, and what avenues they take to get to your goals are important. However, to be educated, you must understand that there is no complete picture of who your visitors are and what they are doing. To get the best possible view of your visitors, you need to combine the data from both on-server and off-server statistical applications. Even then you won't have a complete picture. To fill in the gaps you will need to interact directly with your users.

Reference

Browser Statistics

Statistical Programs

Notes

This is a general recommendation for someone trying to get a general view of traffic across a site. However, hits are useful if viewed in relation to specific files. If your statistics software shows you how many times a specific page or file was hit, then you can get a feeling for which specific files get called the most. For example, if you see that a specific video file gets hit much higher than other files, you can make certain assumptions such as it is popular with your visitors (they keep refreshing the video and watching it over and over) or it is being shown on many pages across your site.
Even an IP address can be suspect. I could be sitting at a computer at one IP address, open a browser from another computer on a different IP address, and then visit your website. In this case, I could be sitting in my office in the USA but logged into a computer in Japan—your server would show that someone from Japan visited your site.