Archives, Logs and Google Analytics

It is vital to have a sense of the value of your service, and if you run a website, particularly a discovery website, you want to be sure that people are using it effectively. This is crucial for an online service like the Archives Hub, but it is important for all of us, as we invest time and effort in putting things online and we are aware of the potential the Web gives us for opening up our collections.

But measuring use of a website is no simple Picture of statistics chartthing. You may hear people blithely talking about the number of ‘hits’ their website gets, but what does this really mean?

I wanted to share a few things that we’ve been doing to try to make better sense of our stats, and to understand more about the pitfalls of website use figures. There is still plenty we can do, and more familiarity with the tools at our disposal may yield other options to help us, but we do now have a better understanding of the dangers of taking stats at face value.

Logs

We are all likely to have usage logs of some kind if we have a website, even if it is just the basic apache web logs. These are part of what the apache web server offers. The format of these can be configured to suit, although I suspect many of us don’t look into this in too much detail. You may also have other logs – your system may generate these. Our current system provides a few different logs files, where we can find out a bit more about use.

Apache access logs typically contain: the IP address of the requesting machine, the date of the access, the http method (usually ‘get’ or ‘post’), the requested resource (the URL of the page, image, pdf etc.), the size of what is returned, the referring site, if available, and the user agent. The last of these will sometimes provide information on the browser used to make the request, although this will often not be the case.

Example apache access log entry:

54.72.xx.xx - - [14/Sep/2015:11:51:19 +0000] "GET /data/gb015-williamwagstaffe HTTP/1.1" 200 22244 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

So, with this you can find out some information about the source of the request from the IP addresses and what is being requested (URL of resource).

Other logs such as our current system’s search logs may provide further information, often including more about the nature of the query and maybe the number of hits and the response time.

Google Analytics

Increasingly, we are turning to Google Analytics (GA) as a convenient method of collecting stats, and providing nice charts to show use of the service. Google Analytics requires you to add some specific code to the pages that you want tracked. GA provides for lots of customisation, but out of the box it does a pretty good job of providing information on pages accessed, number of accesses, routes, bounce rate, user agents (browsers), and so on.

Processing your stats

If you do choose to use your own logs and process your stats, then you have some decisions to make about how you are going to do this. One of the first things that I learnt when doing this is that ‘hits’ is a very misleading term. If you hear someone promoting their site on the basis of the number of hits, then beware. Hits actually refers to the number of files downloaded on your site. One page may include several photos, buttons and other graphics, and these all count as hits. So one page accessed may represent many hits. Therefore hits is largely meaningless as a measure of use. Page views is a more helpful term, as it means one page accessed counts as ‘one’.

So, if you are going to count page views, do you then simply use the numbers the logs give you?

Robots

One of the most difficult problems with using logs is that they count bots and crawlers. These may access your site hundreds or thousands of times in a month. They are performing a useful role, crawling and gathering information that usually has a genuine use, but they inflate your page views, sometimes enormously. So, if someone tells you they have 10,000 page views a month, does this count all of the bots that access the pages? Should it? It may be that human use of the site is more like 2,000 page views per month.

Identifying and excluding robot accesses accurately and consistently throughout every reporting period is a frustrating and resource intensive task. Some of us may be lucky enough to have the expertise and resources to exclude robots as part of an automated process (more on that with GA), but for many of us, it is a process that requires regular review. If you see an IP address that has accessed thousands of pages, then you may be suspicious. Investigation may prove that it is a robot or crawler, or just that it is under suspicion. We recently investigated one particular IP address that gave high numbers of accesses. We used the ‘Honey Pot‘ service to check it out. The service reported:

“This IP address has been seen by at least one Honey Pot. However, none of its visits have resulted in any bad events yet. It’s possible that this IP is just a harmless web spider or Internet user.”

The language used here shows that even a major initiative to identify dodgy IP addresses can find it hard to assess each one as they come and go with alarming speed. This project asks for community feedback in order to continually update the knowledge base.

We also checked out another individual IP address that showed thousands of accesses:

“The Project Honey Pot system has detected behavior from the IP address consistent with that of a rule breaker. Below we’ve reported some other data associated with this IP. This interrelated data helps map spammers’ networks and aids in law enforcement efforts.”

We found that this IP address is associated with a crawler called ‘megaindex.com/crawler’. We could choose to exclude this crawler in future. The trouble is that this is one of many. Very many. If you get one IP address that shows a huge number of accesses, then you might think it’s a bot, and worth investigating. But we’ve found bots that access our site 20 or 30 times a month. How do you identify these?  The trouble is that bots change constantly with new ones appearing every day, and these may not be listed by services such as Honeypot. We had one example of a bot that accessed the Hub 49,459 times in one month, and zero times the next month.We looked at our stats for one month and found three bots that we had not yet identified – MegaIndex, XoviBot and DotBot. The figures for these bots added up to about 120,000 page views just for one month.

404: Page Not Found

The standard web server http response if a page does not exist is the infamous ‘404‘. Most websites will  typically generate a “404 Not Found” web page. Should these requests be taken out of your processed stats? It can be argued that these are genuine requests in terms of service use, as they do show activity and user intent, even if they do not result in a content page.

500: Server Error

The standard http response if there’s been a system problem of some kind is the ‘500’ Sever Error . As with the ‘404’ page, this may be genuine human activity, even if it does not lead to the user finding a content page. Should these requests be removed before you present your stats?

Other formats

You may also have text pages (.txt), XML pages (.xml) and PDFs (.pdf). Should these be included or not? If they show high use, is that a sign of robots? It may be that people genuinely want to access them.

Google Analytics and Bots

As far as we can tell, GA appears to do a good job of not including bots by default, presumably because many bots do not run the GA tracking code that creates the GA page request log. We haven’t proved this, but our investigations do seem to bear this out.  Therefore, you are likely to find that your logs show higher page accesses than your GA stats. And as a bot can really pummel your site, the differences can be huge. Interestingly, GA also now provides the option to enable bot filtering, but we haven’t found much evidence of GA logging our bot accesses.

But can GA be relied upon?  We had a look in detail at some of the logs accesses and compared them with GA. We found one IP address that showed high use but appeared to be genuine, and the user agents looked like they represented real human use. The pattern of searching and pages accessed also looked convincing. From this IP address we found one example of an Archives Hub description page with two accesses in the log: gb015-williamwagstaffe. The accesses appeared to come from standard browsers (the Chrome browser). We looked at several other pages accessed from this IP address. There was no evidence to suggest these accesses are bots or not genuine, but they are not in the GA accesses.

Why might GA exclude some accesses? There could be several reasons:

  • GA uses javascript tracking code, and it may not know about the activity because the javascript doesn’t run when the page is requested even though it appears to be a legitimate browser according to the user agent log
  • The requester may be using ad-blocking, which can also block calls to GA
  • It may be a tracking call back failure to GA due to network issues
  • It may be that GA purposely excludes an IP address because it is believed to be a bot
  • It may not be a genuine browser, i.e. a bot, script or some other requesting agent that doesn’t run the GA tracking code

Dynamic single page applications

Modern systems increasingly use html5 and Ajax to load content dynamically. Whereas traditional systems load the analytics tracker on each page load, these ‘single page applications require a different approach in order to track activity.  This requires using the new ‘Google Universal Analytics’ and doing a bit of technical work. It is not necessarily something we all have the resource and expertise to do. But it may mean that your page views appear to go down.

Conclusions

Web statistics are not straightforward. Google Analytics may be extremely useful, and is likely to be reasonably accurate, but it is worth understanding the pitfalls of relying on it completely. Our GA stats fell off a rather steep cliff a few years ago, and eventually we realised that the .xml and .txt pages had started being excluded. This was not something we had control over, and that is one of the downsides of using third party software – you don’t know exactly how they do what they do and you don’t have complete control.

A recent study of How Many Users Block Google Analytics by Jason Packer of Quantable suggests that GA may often be blocked at the same time as ads, using the one of the increasing number of ad blocking tools, and the effect could be significant. He ran a fairly small survey of about 2,400 users of a fairly niche site, but found that 8.4% blocked GA, which is a substantial percentage.

Remember that statistics for ‘hits’ or ‘page views’ don’t mean so much by themselves – you need to understand exactly what is being measured. Are bots included? Are 404s included?

Stats are increasingly being used to show value, but we do this at our peril. Whilst they are important, they are open to interpretation and there are many variables that mean comparing different sites through their access stats is going to be problematic.