Bohdan Szymanik: 04/01/2007

Thursday, April 19, 2007

Singleton DMX

If you've used DMX to generate predictions from a data mining model you will be familiar with the singleton query. This is where you obtain a prediction for a particular set of input attributes; either within a case table or in a nested table. There was an example of one of these in a previous post.

Now the form of that query was:

select prediction
from data mining model
prediction join
(select (select something as nested row
union select somethingelse as nested row etc) as nestedtable)
as t

Now, what I wanted to do was using openquery() to return all the nested rows, but I did not have an explicit case. How do you make it work?

After bugging the team on the www.sqlserverdatamining.com web site in the newsgroup section the answer was obvious. Just go ahead and create a dummy case. A sample query that works for one of my models follows.

SELECT
    Predict([Model_NaiveBayes].[Bucket]),
    PredictProbability([Model_NaiveBayes].[Bucket])
From [Model_NaiveBayes]
NATURAL PREDICTION JOIN
SHAPE { OPENQUERY(Test, 'SELECT 1 as CaseKey') }
APPEND (
    { OPENQUERY(Test,'
    select 1 as ForeignKey, Term
    from Terms
    cross apply Matches('some long discourse containing many terms that I want to characterise'',''\b('' + Term + '')\b'')')
    } RELATE [CaseKey] TO [ForeignKey]
) AS [Msg Term Vectors]
AS T

In this case it uses the Matches TVF which I describe in a previous post to identify the terms in the text.

Wednesday, April 18, 2007

Exposing the Regular Expression Match Collection to SQL Server as a Table-Value Function

For a recent pet project I've been attempting to create a text mining model in SQL Server 2005 to analyse incoming messages and automatically bucket them into one of a number of categories. This follows straight on from the text mining example contained in the SQL Server tutorials.

With the model implemented (using Decision Trees, Naive Bayes etc) it's easy to create a singleton query by hand that has the following form:

SELECT
[Model_NB].[Bucket],
TopCount(PredictHistogram([Bucket]), $AdjustedProbability, 3)
From
[Model_NB]
NATURAL PREDICTION JOIN
(SELECT (SELECT 'some defining term' AS [Term]
UNION SELECT 'another identifying noun or phrase' AS [Term]) AS [Msg Term Vectors]) AS t

But I still needed to extract the identifying noun phrases that make up the terms. Given a dictionary of terms and a length of freeform text how do you find all the term occurences?

Using the SQL Server string functions is painful so I thought I'd try the Match Collection object in the CLR. To expose this you need to perform the following operations.

Firstly, enable CLR integration with

EXEC sp_configure 'clr enabled' , '1'

Then create a SqlServer function in .net to expose the MatchCollection eg

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections;
public partial class UserDefinedFunctions
{
     [Microsoft.SqlServer.Server.SqlFunction (FillRowMethodName="RowFiller", TableDefinition="Match NVARCHAR(MAX)")]
     public static IEnumerable Matches(String text, String pattern)
    {
        MatchCollection mc = Regex.Matches(text, pattern);
        return mc;
    }
    // this gets called once each time the framework calls the iterator on the underlying matchcollection
    public static void RowFiller(object row, out string MatchTerm) {
        Match m1 = (Match)row;
        MatchTerm = m1.Value;
    }
};

Then deploy using Visual Studio, or if you want to do it manually try:

CREATE ASSEMBLY MatchesAssembly FROM 'c:\somewhere\some.dll' WITH PERMISSION_SET = SAFE

CREATE FUNCTION Matches(@text NVARCHAR(MAX), @term NVARCHAR(MAX))
RETURNS TABLE
(Matches NVARCHAR(MAX))
AS
EXTERNAL NAME TextTools.UserDefinedFunctions.Matches

And voila, you can do the following...

select *
from terms cross apply Matches(@text,'\b(' + terms.term + ')\b')

Where terms is a table of noun phrases you're searching for in the variable @text that contains the message text.

Tuesday, April 10, 2007

Measuring how quickly we learn from our mistakes

For 10 years I've heard how you need to learn from your mistakes and it seems quite a reasonable mantra. But the problem I've got with learning from mistakes is how quickly do you learn.

Some people and some organisations seem to learn really quickly, others, more like myself, take a while longer, others take the whole of their existence.

So the question I've got is how do measure the responsiveness of someone, some organisation, to learning from their mistakes?

What metric could we use to tune performance?

Not sure, but maybe we can think something up.

Sunday, April 08, 2007

Whales off the Beach

Wonderful sight - whales off the beach, maybe 300m from shore on a beautiful still, sunny day. Not sure what type, but I don't think they were Orcas - they didn't seem large enough. The shame of it is that if it weren't for the cold I've had for the last few days, my plan was to be out kayaking this morning. Wouldn't that have been amazing!

Oh, perhaps a bigger shame, especially from the point of view of the whales, was the idiot out there going round and round in circles on a jetski. Confirms every opinion I've ever had of them.

Update: Talking to someone this morning I found out they were indeed Orcas. Now I really wish I'd been up to going out on the kayak.

Friday, April 06, 2007

The value of a demonstrable prototype

I've just been reading an interview by Ron Jacobs with Scott Guthrie; it's up on The Microsoft Architect Journal (http://msdn2.microsoft.com/en-us/library/bb266332.aspx).

What's caught my attention is the importance Scott put on having a demonstrable prototype of their ASP.Net technology early in the development. "There were three or four of us" and yet they managed to create one of the key product offerings of one of the world's largest companies. A truly remarkable outcome.

Now I'm sure that the team itself was extraordinary, but still I bet in many companies they would have been driven into the ground under the weight of the stakeholders. The more people involved, the more diluted the good ideas become and the greater effort needs to be put into selling ideas and managing the stakeholders. The processes companies typically put into place to manage new delivery put a stranglehold on true innovation.

'...That's one of the successful things that we did with ASP.NET. We said that we're
going to throw away every line of code we're going to write for the next couple
of months. Let's all agree on that. We're not going to say, "Oh let's take this
and adapt it; we can clean it up." No. We're going to throw it away. We're going
to "deltree" this subdirectory at some point, and that way we can be more
adventurous about trying new things. We don't have to worry about making sure
that everything's robust because it's going to be in the final version.

We actually did that for a few months and said, "We're done, delete it; let's start
over from scratch; now let's write the full production code and make sure we
bake in quality at the time." I think a lot of teams could benefit from that.
The hardest thing is making sure you delete the prototype code. Too often,
projects develop with "Well, it's kind of close." It's very difficult to start
with a prototype and make it robust. I'm a firm believer in starting with a
prototype phase and then deleting it.'

The typical corporate company approach would not have sustained this type of development.

It would be an incredibly gifted individual that could convince a steering committee to support expenditure on experimentation when little or no artifacts could be easily capitalised at the end of a 3 month development. Most companies want functional delivery early with a clear line of connection from end user requirements to final deliverable.

Scott talks specifically about the value of a prototype to help sell the idea.

"We certainly had to persuade a number of people along the way. One thing we
did early in the project was get running code in prototypes that we could show
people. Often when you're working on a project that's new or something that
hasn't been done before, it's easy to put together a bunch of PowerPoint slides
that sound good, but it's especially valuable to actually show code and walk
people through code that's running. Not only does the prototype prove that it's
real, but also you just learn a terrific amount by doing it."

Perhaps their greatest achievement was the successful marketing of the development from initial concept through to the current deliverables. The early prototyping and demonstrations were obviously a critical part of this achievement.

At the end of the day all of this makes great fodder for the creation of a myth list: the My Myth List. My first two myths are going to be:

End users know what they require
Product delivery is a linear process

Because it's apparent that people don't know what they want. It's the visualisation of what is possible that provides a basis for requirements. And product delivery is never linear.

I'm sure I'll think of more over the next few weeks.

Sunday, April 01, 2007

21C in the pool on April 1st

Global warming? Well it was a late start to summer...

Commotion at the beach

One of the Kapiti island ferrys was caught out this afternoon. I joined in a small crowd to watch as the tractor that was supposed to haul it out of the water got stuck in the sea. The engine appeared to have failed and in a scene reminescent of the Little Digger, 3 tractors lined up in a row to try and tow the stranded tractor and ferry out, without success. They temporarily gave up on the ferry, hauled out the tractor before the tide came in further, then towed the ferry onto the shore to allow the passengers out. A bit of excitment while I was down at the playground with Lil!

Bohdan Szymanik