Small ISV Marketing

To content | To menu | To search

Wednesday 26 August 2009

Moved to another location

This blog moved back to the main Céondo Ltd blog, within its own section. The content will be slowly reimported into the main blog. You can subscribe to the feed here.

It makes more sense to have everthing in one place.

Monday 10 August 2009

Real time robot or crawler detection based on user agent

As explained in my trial to track unique visitors on my website, I am doing an analysis of the access to the robots.txt file, but I am also doing a simple user agent string analysis. The following function determines if a visitor is a bot or not:

    function isBot($user_agent)
    {
        static $bots = array('robot', 'checker', 'crawl', 'discovery',
                             'hunter', 'scanner', 'spider', 'sucker', 'larbin',
                             'slurp', 'libwww', 'lwp', 'yandex', 'netcraft',
                             'wget');
        static $pbots = array('/bot[\s_+:,\.\;\/\\\-]/i',
                              '/[\s_+:,\.\;\/\\\-]bot/i',
                              '/^\s*Mozilla\/\d\.0\s*$/i');
        foreach ($bots as $r) {
            if (false !== stristr($user_agent, $r)) {
                return true;
            }
        }
        foreach ($pbots as $p) {
            if (preg_match($p, $user_agent)) {
                return true;
            }
        }
        if (false === strpos($user_agent, '(')) {
            return true;
        }
        return false;
    }

It is working pretty well at the moment, I am improving it on a regular basis based on my logs. Maybe a more rigorous approach based on publicly available bot name data would be even better.

Friday 7 August 2009

Split testing, A/B testing and friends

Now that I have a fairly robust way to track the unique visitors of my website, I need to explore a way to use that. A said earlier, I want to perform some split testing.

What is split or A/B testing?

In short, you split you user base randomly in 2 or more groups and you provide each group with a given version of your page or website. You also define some conversion goals, for example accessing a given page and then you track which group has the best conversion rate.

If you want to learn more, Jesse Farmer has a good introduction to A/B testing.

How to implement A/B testing?

You will need to define:

  • a test case;
  • and the corresponding conversion goal of the test;
  • for each test, some alternatives;
  • then you will need to log which alternative or treatment each visitor got;
  • and log for each user the conversion success.

Of course, you need to be able to track the individual users accessing your website and as such you need a good unique user tracking method.

A bit more details of what to store

The test case should store:

  • name and description;
  • the percentage of visitors to test;
  • status (running or not).

The conversion goal:

  • name;
  • URL of the goal or regex to match the goal;
  • a reference to the test.

The alternatives:

  • name;
  • the slot (see treatment below);
  • the alternative text in HTML or what is needed;
  • a reference to the test.

For the treatments, you need to think a bit more further than A/B testing and go into multivariable testing. Multivariable testing is the fact that instead of testing one or another alternative (A or B), you test several alternatives on the same page/website at the same time. It is basically an extended version of the A/B testing method. You can consider it as a huge A/B test with several alternatives based on all the combination of elements.

Imagine you have a webpage with a banner and a title (2 slots, one banner and one title) and you want to test 2 different banners B1 and B2 and 3 different titles T1 to T3. This means you have 6 possible treatments: B1T1, B1T2, B1T3, B2T1, B2T2 and B2T3.

You could avoid the definition of the slots by creating directly the 6 alternatives, but that would take some manual time and you will lose the ability to do a bit of statistical kung fu and take a shortcut with the number of tests to run to get some meaningful results.

So, a treatment should store:

  • a reference to the test;
  • which alternatives it includes.

Now you will need to deliver the treatment to the users and log which treatment has been given to whom and what was the outcome.

The treatment log should store:

  • a reference to the test;
  • the delivered treatment;
  • which user got it.

The conversion log should store:

  • a reference to the test;
  • a reference to the goal;
  • and a reference to the visitor.

This structure allows us to track different goals at the same time, this can possibly save time.

The A/B testing base workflow

The workflow is relatively simple, you basically do the following:

  • If the visitor is new for this test, find randomly if you need to test it based on the percentage of visitors to test;
  • if you do not test it, just send the base case and log it as "base treatment";
  • If you need to test it, select randomly a treatment, log it in the treatment log and deliver it.
  • If the visitor is not new for this test, send him the already assigned treatment.

On each page, check if an active goal is matched, if so, log the conversion success for this visitor.

What is important is to never send 2 different treatments for a given user.

What to do next?

Implement that to do some funny testing on the homepage of indefero.net and report you the results.

Comparing Google Analytics and the server side solution unique visitor tracking

So, now that I have a full day of records, I can now start to get some statistics from my unique visitor tracking test. To get the number of unique visitors coming on the website yesterday, I just run:

SELECT COUNT(*) FROM
   (SELECT DISTINCT "user" FROM indefero_idfa_tracker_logs
    LEFT JOIN indefero_idfa_tracker_users
    ON indefero_idfa_tracker_users.id="user"
    WHERE bot IS FALSE
       AND DATE(indefero_idfa_tracker_logs.creation_dtime) = CURRENT_DATE - 1

    GROUP BY "user")
   AS foo;

This gives me 151 unique visitors. Doing a small check of the indefero_idfa_tracker_users table, I found 2 bots, so 149 unique visitors. Now, let's go and ask Google Analytics and I get 120 absolute unique visitors, a 20% difference

Note that the tracking is running on indefero.net, which is a website targeted towards geeks, I think that for this demographic group a 20% difference is not that big, this means that only 20% are blocking the GA tracker code. Anyway, I am really happy with the results and this means that the tracking is working well. My backend is PostgreSQL, with MySQL you may need to adapt your query for the date operation.

Bonus: Changing bot IS FALSE by bot IS TRUE gives me 60 bots and crawlers.

Thursday 6 August 2009

First results of unique visitor tracking, the bots and crawlers are here

So the unique visitor tracking test is running. At the moment of writing, I have 159 unique visitors in my visitor table. From an excerpt of the results shown below, it is clear that I need to flag the bots and crawler and exclude them from the page tracking.

Part of the visitor table

id | User agent
82 | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWeb [...]
83 | Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/530.5 [...]
84 | msnbot/2.0b (+http://search.msn.com/msnbot.htm)
85 | DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; [...]
86 | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.2) Gec[...]

Part of the log table

id  | Visitor | Page
341 |      83 | /  
342 |      83 | /tour.html
343 |      56 | /refund/   
344 |      56 | /privacy/   
345 |      84 | /robots.txt  
346 |      84 | /             
347 |      85 | /robots.txt    
348 |      85 | /
349 |      84 | /refund/   

The good thing is that the robots and crawler are good Internet citizens, as you can see for the MSN bot with the id 84, they are always requesting the robots.txt file for the first request. This means that one can directly flag a new visitor as a bot if the first action is to grab the robots.txt file.

Now, this will kick out most of the bots and crawler but not these ones:

70 | Mozilla/5.0

A very minimal user agent string.

303 | 70 | /doc.html//?_SERVER[DOCUMENT_ROOT]=http://www.[...]
304 | 70 | /
305 | 70 | //?_SERVER[DOCUMENT_ROOT]=http://www.[...]
306 | 70 | /doc.html//?_SERVER[DOCUMENT_ROOT]=http://www.[...]
307 | 70 | /  
308 | 70 | /doc.html//?_SERVER[DOCUMENT_ROOT]=http://www.[...]
309 | 70 | //?_SERVER[DOCUMENT_ROOT]=http://www.[...]
310 | 70 | /

And looking to trash my site. I am already not logging the ones without a user agent string, but it looks like I will need to use the heuristics of AWStats to mark more of the visitors as bot.

What to do next?

  • Add a field in the visitor table to mark a visitor as bot.
  • Mark a visitor doing the first request against /robots.txt as crawler/bot.
  • Do not log the requests of the bots.
  • Merge the AWStats robots definition as a simpler regex/substring matching to catch the robots.
  • Add small heuristics for the stupid security scanners. One could perform a small check on the request string to mark them and drop the corresponding logs.

I am going to work on that this afternoon and will report to you the results.

Wednesday 5 August 2009

First approach to unique visitor tracking

In my previous post, I wrote about unique user session tracking, now, here is what I ended up creating to implement that in practice. This approach is undergoing tests by tracking the unique visitors on www.indefero.net. I will then cross check the results with the Google Analytics data of the account to assess the quality of the idea.

Database storage

The storage is composed of 2 tables, one for the visitors and one for the logs. The visitor table is needed as the goal is to track in realtime the unique visitors. To mitigate the need to lookup data in this visitor table, information is cached using Memcached.

The visitor table stores:

  • IP address;
  • User agent;
  • Cookie value;
  • Creation time stamp;
  • Last seen date, this date is update a maximum of 1 time every 30 minutes.

The log table store:

  • visitor (foreign key to the visitor table);
  • page seen;
  • time stamp.

Logging Procedure

To find a visitor in the visitor table, I first search by cookie and if not available by user agent/IP address combination. The real trick is the handling of the missing cookie. In my case, I log just before sending the response, this means that if this is a new visitor or a visitor without cookie, I have a new cookie. When doing the check for the visitor in the table, if the user agent/IP matches but not the cookie, I update the cookie in the table. This is because I have no idea if the visitor will now accept the cookie or not. This could be a performance problem.

Basically, I first perform a cookie check and then I default on the user agent/IP address combination. This is running at the moment on indefero.net (only the presentation website, not the hosted forges) and I will compare the results with the Google Analytics resuts in 24 or 48h. What is already better than GA is that I can see the bots. Maybe I should add a bot flag in the visitor table to easily exclude them when doing reports.

How to track unique user sessions

Goal of the day (or maybe months): 300% increase of my conversion rates.

How to do that: Split testing.

What is needed: Track the unique user sessions of the website in real time.

So, how do you track the unique visitors on your website? I must say, it looks like black magic. I took the time to read the code of AWStats but was not able to understand it as both my fluent Perl is far away in the past and the code is completely written with speed in mind and not concept understanding.

So, Wikipedia on the web analytics page is providing this information:

[A unique user is] an IP address plus a further identifier. Sites may use User Agent, Cookie and/or Registration ID.

Good, so, it means that if I want to track my users, I need to use the IP address (easy), a user agent (easy), a cookie (not so easy) or a registration id (not possible in my case).

Why it is not easy to track with a cookie?

Cookies are optional. With Firefox, I have an extension to disable all the cookies but for the websites I trust.

So, if you consider that you need a distinct pair (cookie, ip) to have a unique user, then, each page I access on your website will count a new unique user.

A possible solution not tested yet

Yes, I need to test it and the solution is to implement it and compare with what gives me Google Analytics.

A unique user session is a combination of:

  • a unique IP address
  • a unique user agent
  • an optional unique cookie
  • all that active within 30 minutes

This approach means that if I do not have a unique cookie and if I have a set of users coming from the same connection with the same browser, it will get counted as a unique user.

Is it a problem? Not really. Why? Because the goal is to perform split testing, so the goal is more to have the minimum number of unique user and to be able to at least mark 50% of them for the split test. So as long as I can get a good fraction of the users with the cookie, I will be happy.

Here are more ideas to explore the tracking without cookies.

Implementation

I am a PHP shop, but you can do it in any language. What you need is simply a database and a fast in memory storage (APC or memcached).

The fast memory storage is to avoid hitting the database at each request and the database is of course to get a bit of persistence. The memory storage expires the value after your desired session time (30 minutes), this automatically takes care of the active session length handling.

The workflow is as follow for a non cookied visit:

  1. User access the website for the first time (or without cookie).
  2. Check the combination of IP + user agent in the memory store.
    1. If available, update the last seen time stamp and try to cookie it.
    2. If not available, add the IP + user agent pair in the memory store, the database and cookie it.

For a user with a cookie:

  1. Check the combination of cookie + IP + user agent in the memory store.
  2. Update the last seen value.

Speed consideration

The tracking must be performed in real time. This is why it is not possible to use the referrer information to follow the path of the user and to dissociate the users accessing the website with same IP/Agent. Anyway, it looks like no single solution will be the optimal but only something like an adaptive algorithm which can give a probability of "uniqueness" of a hit based of compounded methods.