Tuesday, October 16, 2007

Heat Mapping

I've recently had the pleasure of uncovering some new data visualisation -visualization for you Americans- capabilities, in particular heat mapping a measure that's a function of two input variables.

Generating eye catching representations of information is crucial in the corporate world, and yet for the most part, business data visualisation is very poor. Back in the 90s I was busy with research for a physics PhD and I lived by gnuplot and a Windows package called Origin. In transitioning into the business world I also transitioned into Excel. I've enjoyed using Excel and I'm a really great fan of Excel 2007 and especially the data mining tools that Microsoft now provide with it but the problem with the graphs, is that everyone else at work generates graphs that look, well, exactly the same. So, I thought I'd go and take a look at what else is now available.

First off I discovered that Gnuplot and Origin still exist and their capabilities have advanced far beyond what I used when I produced a thesis on a 386 linux pc with.. was it 120MB of disk But, then I discovered Python, the scipy (scientific Python) and numpy (numeric python) libraries, and the MatPlotLib library. Now these looked especially useful because of the image mapping capability which could be used to generate a heat map.

Working with a range of people with different backgrounds in a corporate environment you're constantly trying to re-represent information in differing ways to achieve some connection with your audience. For some people an Excel bar chart works great, for others scatter plots, for others they'll want error bars etc. So you want a range of graphing tools available. Now Excel is good, especially Excel 2007, but there's still a range of visualisations that it can't perform, especially image mapping or heat mapping and some 3D plot types.

Now MatPlotLib can help to fill some of the void, particularly with the image mapping. Check the screenshots to see a range of examples of image (heat) maps and polar plots mixed along with more conventional 2D plot types.

In one example I looked at the pattern of customer usage through our Internet channel as a function of customer age and their total relationship with the company. To do this with MatPlotLib I first collected the data from our database environment, I then used Excel to manipulate the data into an array format (same format as an Excel surface plot), then saved it to a file, I then loaded it into a Python array and used the imshow() command to generate a heat map. I regenerated the graph over successive half year intervals keeping the color range set to run over a constant interval, then combined the images together into a movie with Windows Movie Maker (you can also do this with command line tools such as memcoder which part of the opensource mplayer suite).

The output of this exercise was a movie that showed customer activity through the Internet increasing over time across all age ranges but accentuated by the degree of total transactional relationship they had with the company.

One of the biggest problems you face with generating heat maps/image plots is getting a full range of data. If the value you want to plot is a function of x and y, and you're not actually taking measurements, but just relying instead on historical data you'll probably end up with a lot of missing data points. The imshow() command interpolates data points to smooth over the holes.

The easiest way to get MatPlotLib on your pc/laptop is to install the standard Python distribution and then over the top install the IPython shell. IPython provides you with a scipy item in your programs list which gives you a Python command prompt with all the necessary libraries pre-loaded.

If you're interested in other visualisation tools then I've since found a good background reference at IBM: http://www.ibm.com/developerworks/linux/library/l-datavistools/. It's say's it's for Linux but in reality most of this software runs on Windows (and in fact if you look at the numbers of downloads on some of the packages the majority of downloads are for the WinOS).

2 comments:

Paddy3118 said...

It's good to read articles like this, that remind you of what data visualisation is capable of.I guess if the data is bad , then the graph should equally look bad?

- Paddy.

Bohdan Szymanik said...

Good reference Paddy, it reminds me of the Fallacy Files (http://www.fallacyfiles.org/index.html)