Bohdan Szymanik

Thursday, April 19, 2007

Singleton DMX

If you've used DMX to generate predictions from a data mining model you will be familiar with the singleton query. This is where you obtain a prediction for a particular set of input attributes; either within a case table or in a nested table. There was an example of one of these in a previous post.

Now the form of that query was:

select prediction
from data mining model
prediction join
(select (select something as nested row
union select somethingelse as nested row etc) as nestedtable)
as t

Now, what I wanted to do was using openquery() to return all the nested rows, but I did not have an explicit case. How do you make it work?

After bugging the team on the www.sqlserverdatamining.com web site in the newsgroup section the answer was obvious. Just go ahead and create a dummy case. A sample query that works for one of my models follows.

SELECT
    Predict([Model_NaiveBayes].[Bucket]),
    PredictProbability([Model_NaiveBayes].[Bucket])
From [Model_NaiveBayes]
NATURAL PREDICTION JOIN
SHAPE { OPENQUERY(Test, 'SELECT 1 as CaseKey') }
APPEND (
    { OPENQUERY(Test,'
    select 1 as ForeignKey, Term
    from Terms
    cross apply Matches('some long discourse containing many terms that I want to characterise'',''\b('' + Term + '')\b'')')
    } RELATE [CaseKey] TO [ForeignKey]
) AS [Msg Term Vectors]
AS T

In this case it uses the Matches TVF which I describe in a previous post to identify the terms in the text.

Wednesday, April 18, 2007

Exposing the Regular Expression Match Collection to SQL Server as a Table-Value Function

For a recent pet project I've been attempting to create a text mining model in SQL Server 2005 to analyse incoming messages and automatically bucket them into one of a number of categories. This follows straight on from the text mining example contained in the SQL Server tutorials.

With the model implemented (using Decision Trees, Naive Bayes etc) it's easy to create a singleton query by hand that has the following form:

SELECT
[Model_NB].[Bucket],
TopCount(PredictHistogram([Bucket]), $AdjustedProbability, 3)
From
[Model_NB]
NATURAL PREDICTION JOIN
(SELECT (SELECT 'some defining term' AS [Term]
UNION SELECT 'another identifying noun or phrase' AS [Term]) AS [Msg Term Vectors]) AS t

But I still needed to extract the identifying noun phrases that make up the terms. Given a dictionary of terms and a length of freeform text how do you find all the term occurences?

Using the SQL Server string functions is painful so I thought I'd try the Match Collection object in the CLR. To expose this you need to perform the following operations.

Firstly, enable CLR integration with

EXEC sp_configure 'clr enabled' , '1'

Then create a SqlServer function in .net to expose the MatchCollection eg

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections;
public partial class UserDefinedFunctions
{
     [Microsoft.SqlServer.Server.SqlFunction (FillRowMethodName="RowFiller", TableDefinition="Match NVARCHAR(MAX)")]
     public static IEnumerable Matches(String text, String pattern)
    {
        MatchCollection mc = Regex.Matches(text, pattern);
        return mc;
    }
    // this gets called once each time the framework calls the iterator on the underlying matchcollection
    public static void RowFiller(object row, out string MatchTerm) {
        Match m1 = (Match)row;
        MatchTerm = m1.Value;
    }
};

Then deploy using Visual Studio, or if you want to do it manually try:

CREATE ASSEMBLY MatchesAssembly FROM 'c:\somewhere\some.dll' WITH PERMISSION_SET = SAFE

CREATE FUNCTION Matches(@text NVARCHAR(MAX), @term NVARCHAR(MAX))
RETURNS TABLE
(Matches NVARCHAR(MAX))
AS
EXTERNAL NAME TextTools.UserDefinedFunctions.Matches

And voila, you can do the following...

select *
from terms cross apply Matches(@text,'\b(' + terms.term + ')\b')

Where terms is a table of noun phrases you're searching for in the variable @text that contains the message text.

Tuesday, April 10, 2007

Measuring how quickly we learn from our mistakes

For 10 years I've heard how you need to learn from your mistakes and it seems quite a reasonable mantra. But the problem I've got with learning from mistakes is how quickly do you learn.

Some people and some organisations seem to learn really quickly, others, more like myself, take a while longer, others take the whole of their existence.

So the question I've got is how do measure the responsiveness of someone, some organisation, to learning from their mistakes?

What metric could we use to tune performance?

Not sure, but maybe we can think something up.

Sunday, April 08, 2007

Whales off the Beach

Wonderful sight - whales off the beach, maybe 300m from shore on a beautiful still, sunny day. Not sure what type, but I don't think they were Orcas - they didn't seem large enough. The shame of it is that if it weren't for the cold I've had for the last few days, my plan was to be out kayaking this morning. Wouldn't that have been amazing!

Oh, perhaps a bigger shame, especially from the point of view of the whales, was the idiot out there going round and round in circles on a jetski. Confirms every opinion I've ever had of them.

Update: Talking to someone this morning I found out they were indeed Orcas. Now I really wish I'd been up to going out on the kayak.

Friday, April 06, 2007

The value of a demonstrable prototype

I've just been reading an interview by Ron Jacobs with Scott Guthrie; it's up on The Microsoft Architect Journal (http://msdn2.microsoft.com/en-us/library/bb266332.aspx).

What's caught my attention is the importance Scott put on having a demonstrable prototype of their ASP.Net technology early in the development. "There were three or four of us" and yet they managed to create one of the key product offerings of one of the world's largest companies. A truly remarkable outcome.

Now I'm sure that the team itself was extraordinary, but still I bet in many companies they would have been driven into the ground under the weight of the stakeholders. The more people involved, the more diluted the good ideas become and the greater effort needs to be put into selling ideas and managing the stakeholders. The processes companies typically put into place to manage new delivery put a stranglehold on true innovation.

'...That's one of the successful things that we did with ASP.NET. We said that we're
going to throw away every line of code we're going to write for the next couple
of months. Let's all agree on that. We're not going to say, "Oh let's take this
and adapt it; we can clean it up." No. We're going to throw it away. We're going
to "deltree" this subdirectory at some point, and that way we can be more
adventurous about trying new things. We don't have to worry about making sure
that everything's robust because it's going to be in the final version.

We actually did that for a few months and said, "We're done, delete it; let's start
over from scratch; now let's write the full production code and make sure we
bake in quality at the time." I think a lot of teams could benefit from that.
The hardest thing is making sure you delete the prototype code. Too often,
projects develop with "Well, it's kind of close." It's very difficult to start
with a prototype and make it robust. I'm a firm believer in starting with a
prototype phase and then deleting it.'

The typical corporate company approach would not have sustained this type of development.

It would be an incredibly gifted individual that could convince a steering committee to support expenditure on experimentation when little or no artifacts could be easily capitalised at the end of a 3 month development. Most companies want functional delivery early with a clear line of connection from end user requirements to final deliverable.

Scott talks specifically about the value of a prototype to help sell the idea.

"We certainly had to persuade a number of people along the way. One thing we
did early in the project was get running code in prototypes that we could show
people. Often when you're working on a project that's new or something that
hasn't been done before, it's easy to put together a bunch of PowerPoint slides
that sound good, but it's especially valuable to actually show code and walk
people through code that's running. Not only does the prototype prove that it's
real, but also you just learn a terrific amount by doing it."

Perhaps their greatest achievement was the successful marketing of the development from initial concept through to the current deliverables. The early prototyping and demonstrations were obviously a critical part of this achievement.

At the end of the day all of this makes great fodder for the creation of a myth list: the My Myth List. My first two myths are going to be:

End users know what they require
Product delivery is a linear process

Because it's apparent that people don't know what they want. It's the visualisation of what is possible that provides a basis for requirements. And product delivery is never linear.

I'm sure I'll think of more over the next few weeks.

Sunday, April 01, 2007

21C in the pool on April 1st

Global warming? Well it was a late start to summer...

Commotion at the beach

One of the Kapiti island ferrys was caught out this afternoon. I joined in a small crowd to watch as the tractor that was supposed to haul it out of the water got stuck in the sea. The engine appeared to have failed and in a scene reminescent of the Little Digger, 3 tractors lined up in a row to try and tow the stranded tractor and ferry out, without success. They temporarily gave up on the ferry, hauled out the tractor before the tide came in further, then towed the ferry onto the shore to allow the passengers out. A bit of excitment while I was down at the playground with Lil!

Thursday, March 22, 2007

ARCast video up for Kiwibank Case Study

I've just found out that Ron Jacobs has posted a video interview of myself and two members of my team on his ARCast site.

The interview was based on the Kiwibank Case Study which we undertook last year with Microsoft. The interview takes a broad look at how Kiwibank has used technology to help it go from 0 to 12% of NZ's population in 5 years. I speak along with David Grahame and Sushil Kamanahalli. David is the client applications architect and Sushil is the service layer architect.

Special thanks go out to Mark Carroll for helping to organise this and to Ron Jacobs for the interview and the work to put it together.

Fashion at Government House!

What a fantastic night. I had the absolute pleasure of seeing my wife, Miriam Gibson, present her winter collection womens fashion wear at a charity event held at Government House in Wellington on Tuesday night.

It was magnificent. The show was held in the Ballroom with the Governer General, His Excellency The Honorable Anand Satyanand, and his wife, Her Excellency Mrs Susan Satyanand, army staff in regalia, members of Rotary and the charitable organisation, Refugees as Survivors, and over 200 guests who had come to see the launch of the 2007 winter range and support a good cause.

And I was very, very, very impressed. Miriam, Victoria, Sue, Sarah, Veronica, and all the models - you all did a helluva job!

And standing up there following up the Governor General, the local Rotary head, and representatives of the charity with a speech on the podium with microphones, photographers, and press present. Crikey - my recent speaking engagements pale into comparison.

The range was fantastic so I definitely recommend checking it out: head to the stores in Margaret Rd, Raumati and Hunter St, Wellington, or check out the web site to find out more (pictures from the event are promised over the next few days). Nothing for the guys, this is ladies only. And don't forget the charity - they're a very worthwhile cause in this country which has been the fortunate recipient of many refugees in the past, including the Polish half of my ancestry...

Saturday, March 17, 2007

Rod Drury on investment for IP

http://www.drury.net.nz/2007/03/17/building-intellectual-property/

Notice the graph that Rod has obtained of normalised patents per country. See how NZ sits at about 0.5 in 22nd place. Finland rides at the head on 4.5 and the OECD average sits at just under 2 which means it's closer to Finland than NZ.

This is a graphic representation of the under investment in R&D in NZ compared to other countries. This isn't about centralised R&D run out of the government, but about the failing of companies to invest in product development. Great thing that Rod pointed this measure out.

The Business, You and Me; Get It Together!

If you've read some of the previous posts you may have realised that I currently work in the IT area of a bank: the chief architect role at Kiwibank in fact (as those who attended the keynote at the recent Microsoft NZ Tech Briefings, or have read our Microsoft case study will know). Prior to Kiwibank I worked for a year at ENZA, then before that a few years at Deloitte Consulting, and prior to that I undertook a physics PhD from Victoria University in association with Industrial Research Ltd, a Crown Research Institute.

Each of these organisations has taught me a little bit more about how people work together and what makes us succeed in delivering. It has also highlighted the precarious and unappreciated position that the shared service line holds, especially the IT shared service line.

ENZA was an organisation that went during the year that I was there, from an external market focussed apple and pear exporter with producer board status and mandated export control, to a grower focussed commercial entity that retreated from the political environment of Wellington to the safety and security of the grower stronghold Hastings. As a company it had two strategic directions ahead of it, either strive to become a global category specialist based perhaps in one of the major trading hubs, or become a grower focussed organisation that showed it's value at the farm gate. Retreating was certainly the less risky option and it was that path that lead to the rationalisation with Turners and Growers in a 2003 merger.

At ENZA the (naive) question faced in 2000 was what part of the company represented the future of the business: the export facing arm or the grower facing arm. IT at this time was treated as a cost centre run under the finance group. And seeing as the organisation ran SAP it was certainly some cost.

In the middle of 2000 while working at ENZA I happened to come across a chap at Turners and Growers who explained that they were running a home grown software suite that was at the end of the tether. Myself and two others actually went and visited them in Auckland and it was apparent that they had problems. The obvious thought to the three of us was imagine combining the two organisations and taking advantage of the SAP implementation at ENZA. It would be a great asset right?

Well, that is indeed what happened. Someone out there saw the synergy: did Tony Gibbs think about this I wonder? Whoever it was they certainly knew a thing or two about SAP and IT in general. I note that there's now a customer success story about SAP and Turners and Growers on the SAP web site.

Prior to that experience I was at Deloitte Consulting which presented me with opportunities to work in companies and organisations across government, the health sector, and telecommunications. Now for all the minor gripes that many of us had there at the time - long hours etc etc - one thing definitely stands out: the value of good people. I did work with some very good people and while often thrown in at the deep end we did ok. A small group went on to do especially well, witness Trademe and AMR. Being a consulting group we didn't have much of an IT function. Information Technology was a core attribute of our service line and overlaid across the group was a matrix model representing sector and service advocacy. I think it worked well.

In comparison many of the companies we worked for had well defined structures with strong vertical focus on product delivery. You'd walk into these organisations and there were barriers everywhere. Internal development was hardly ever undertaken. Individual business units would occasionally issue RFIs, RFPs or succumb to the salesmanship of a clever vendor. Work would always proceed on the basis of a long chain approach that ensured the people that understood what was possible never had a chance to really influence the development of new ideas in the organisation. Certainly not outside of the immediate business unit.

In these environments you'd always here the catch phrase: "it's up to the business to decide", or often from the PMs/BAs, "we have to listen to the business", or the classic "the business wants...".

The depressing thing is that this is more often voiced by the staff of the IT department than the business units themselves. If people in an IT department don't think they're value contributing then they deserve to be treated as a cost centre and outsourced to the likes of EDS or IBM.

It's a personal mission of mine that my application delivery group does not come out with the same nonsense. It's the innovation that comes from those that know what's possible combined with the people that can advocate for a customer, and those that know the financial constraints and tools, and those that can market the products that creates value contribution in a company. Any organisation that forgets the value of the combined talent of all its resources deserves to lose market value.

Oh, and Industrial Research? A depressing environment of disillusioned scientists with ideas but no knowledge of how to commercialise them...

Thursday, March 15, 2007

Wellington Microsoft Tech Briefing

It was another great event yesterday and it's just fantastic to be seeing so many people. Wellington is my home town so there were plenty (plenty) of faces I recognised in the audience. A big thank you for opportunity goes out to Mark Carroll, Rebecca, Sean, Dean, Carol and all the others. Being the second time through the short speech I make in the keynote gave me a chance to think twice about the message I was trying to present.

And it's confirmed in my mind that the main message I want to get across to people is actively think about opportunities in their organisations, experiment with technologies and tools, and work on marketing any ideas they may come up with. It's how to make things happen and you know, life is too short to being dumb stuff when you could be doing cool stuff.

As a great way to finish off the day I got to attend the Microsoft Architects Council meeting at the hotel. I like to attend these events as it's a good chance to catch up with people I don't see every day. We have an active group of people up and down the country that attend these events and the chance to explore ideas is never something to pass up.

Next week is the final Tech Briefing in Christchurch. I can't wait for this as I know by then I'll be wanting to tune the message once more!

Prioritisation: Apples and Oranges

Every company seems to share the ritual of the prioritisation session. It has a common format and a common process.

Each business head gets to voice their opinion on what's most important to them. These are dutifully collated into a master list and then a discussion takes place to rank one above the other based upon some number of criteria; typically financial, customer experience, and compliance. Finally the agreed list is circulated for action.

It's value is normally limited because importance is not a good measure for prioritisation.

Importance is a measure of emotional conviction. It is a broad term that can mean many things depending upon the subject. The importance of a programme of work is not the same thing as the importance of an immediate fix, or the importance of a process review, or the importance of addressing a particular risk. In each case the definition of the term, importance, differs and therefore it can not be used for comparison. It is accurately a measure of emotional response, but I doubt little else.

What is the alternative? Perhaps it's better to ask what's the point.

The purpose of the prioritisation session is to allocate scarce resources. Scarcity can only be resolved through a process of trade-off (this is text book economics). What complicates the task in an organisation are the differing time requirements, resource specialisations, and dependency effects.

We have a limited ability to weigh up the combination of time, resource, specialisation and dependency factors to determine how limited resources can be applied to a range of competing tasks. Our minds have to make best guess estimates and the wider the scope the greater the problem (I bet someone out there can prove this is a power law expansion).

Which drives us to smaller delivery teams to reduce the scope of the problem.

So what should the prioritisation session be?

Perhaps to set areas of focus and define the criteria for prioritisation. I doubt little more.

Tuesday, March 13, 2007

Innovation in the Corporate Environment

This is a hot topic for me. I work in one of these environments and I'm involved in a fairly traditional (these days) role of enterprise architect: nominally responsible for the overall design of systems to ensure they meet business needs, and typically driven more from the perspective of policy and process than the introduction of new ideas. I'm afraid I'm not a very good enterprise architect.

Being on the back foot and not contributing to the ideas that form the basis of many of the commercial opportunities seems quite daft to me.

My aim in life is instead to communicate the opportunities of technology, or in fact, any thing that comes to mind actually. You know I went through university in a rather clueless manner and it's only now I see the possibilities of the methods taught to me at the time. There is just so much out there that can help give you an edge. (Wish I'd paid a bit more attention in the lecture rooms....)

Ensuring technology meets business needs is never going to see innovative solutions deployed, it's never going to see solutions applied when problems aren't yet realised to exist. How often have we looked around and seen only in retrospect that we missed the ball completely (trust me in my 4o years it's happened a helluva lot!).

So, yes, innovation is a hot topic for me.

Today I read an article syndicated from some offshute of The Economist called, innovatively enough, The Economist Newspaper, referring to the demise of traditional R&D and the rise of a new form of directed innovation concentrating on the D aspect. I'm not against this, a lot of the great ideas out there (that I've missed the boat on) have typically only been a couple of years ahead of everyone else's thinking. But the fact they were ahead proved a significant advantage.

The article began by looking at the output of Vannevar Bush, a gifted thinker in his own right, and an advisor to the Roosevelt and later administrations. It was Vannevar that spearheaded America's implementation of government and military funded R&D from the 1940s to the 1970s. "Industry is generally inhibited by preconceived goals, by its own clearly defined standards, and by the constant pressure of commercial necessity," he wrote in 1945. It still rings true today.

But, the days of the big labs are gone. Bell Labs has fallen apart, IBM's research labs are far more highly directed now, Microsoft Research nominally allow free reign but then look at the narrow range of papers on their site. Where's all the Research and what can a smaller company do?

It seems to me that in the mid-size corporate environment (with a few hundred staff) there is one classic failing: the creation of the product delivery chain. You know the one. It starts with the customer on the street, then there's marketing, then BAs, then project teams, and at the end of the line, the implementors.

Nothing driven down such a long chain will be innovative. The people at the end of the line act out a Dilbertesque cartoon living in perpetual frustration. The customers only get what they ask for; and no more. The nimble, smart companies out there create their own new niches and the slow ones are left to play catch up.

I think the secret to this is to keep team sizes down and allow small teams to experiment with ideas ensuring that at an overview level there is a process of nurturing and selection. Allowing failure to occur has to be an integral part. "Please fail very quickly - so that you can try again" says Eric Schmidt from Google.

Ground breaking products and processes are always due to the conceptual insights of individuals. So it should be the task of every innovative organisation to provide a mechanism to foster the intellectual output of their staff.

Response Time Distributions, IIS Log Files, and the question of the Missing Events

Over the last year I've been involved in a number of investigations attempting to find bottlenecks in systems consisting of clients, web service hosts, and databases (usually containing application logic in addition to data). The details of each system's implementation is not especially important to this discussion because what I want to do here is just relate one of my recent experiences regarding measurement of response times. You might like to check if you get similar behaviour.

Firstly, let's just describe the typical situation I find myself looking it. It's very generic, I'm sure the same thing will apply to you. I usually have some client systems accessing a service layer hosted in IIS6 talking to a database server (usually with significant embedded application logic).

The Problem
The problem (or my lesson in this case) is how to interpret the numbers you get from the IIS log files on the web service hosts. These files give you HTTP request duration and the arrival time of the request. Now what happens when you naively plot a distribution of the the duration (ie request execution time)? You might expect a nice symmetrical peak centered on some value, or you might get something such as the following.

I found that I was consistently getting this sort of shape across different types of requests. This isn't really all that unexpected. Each URI in the log file corresponds to a web service against which a number of web methods may be called. The web methods may have significantly different response times so the graph of the service call is really just a summation of all the individual web method calls. And in fact we've now implemented duration timing on each individual web method call, and in fact we get a much simplified distribution centred round one peak.

Anyhow, while I was looking into the double peak I decided to look at the arrival rate and compare that to the duration of the call. Theoretically you should get a Poisson distribution for random, uncorrelated events and on those occasions when there were many simultaneous arrivals you'd also expect the response time to slow down (although whether this was linear/non-linear is another question).

So, I looked for a period of time during which we have fairly constant activity and chose 2 hours in the middle of the day.

You can see that there's a nice distribution with the expected shape centred on 4 arrivals/second. Of course the IIS log files only record data when a request actually arrives. Looking at the bar graph you'd therefore naively expect about 300 one second intervals over the two hour period during which no requests arrived.

And this is where I got a surprising result.

So, it appears that the IIS HTTP arrival time is not accurate. I think that what I'm seeing here is that IIS is already queuing the requests up for processing - presumably because processing downstream is taking too long.

If you get anything like this, have seen it before, or have a bit more knowledge of what is going on I'd love to hear about it.