Wednesday, April 18, 2007

Exposing the Regular Expression Match Collection to SQL Server as a Table-Value Function

For a recent pet project I've been attempting to create a text mining model in SQL Server 2005 to analyse incoming messages and automatically bucket them into one of a number of categories. This follows straight on from the text mining example contained in the SQL Server tutorials.

With the model implemented (using Decision Trees, Naive Bayes etc) it's easy to create a singleton query by hand that has the following form:

 

SELECT
[Model_NB].[Bucket],
TopCount(PredictHistogram([Bucket]), $AdjustedProbability, 3)
From
[Model_NB]
NATURAL PREDICTION JOIN
(SELECT (SELECT 'some defining term' AS [Term]
UNION SELECT 'another identifying noun or phrase' AS [Term]) AS [Msg Term Vectors]) AS t

But I still needed to extract the identifying noun phrases that make up the terms. Given a dictionary of terms and a length of freeform text how do you find all the term occurences?

Using the SQL Server string functions is painful so I thought I'd try the Match Collection object in the CLR. To expose this you need to perform the following operations.

Firstly, enable CLR integration with

EXEC sp_configure 'clr enabled' , '1'

Then create a SqlServer function in .net to expose the MatchCollection eg

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections;
public partial class UserDefinedFunctions
{
     [Microsoft.SqlServer.Server.SqlFunction (FillRowMethodName="RowFiller", TableDefinition="Match NVARCHAR(MAX)")]
     public static IEnumerable Matches(String text, String pattern)
    {
        MatchCollection mc = Regex.Matches(text, pattern);
        return mc;
    }
    // this gets called once each time the framework calls the iterator on the underlying matchcollection
    public static void RowFiller(object row, out string MatchTerm) {
        Match m1 = (Match)row;
        MatchTerm = m1.Value;
    }
};

Then deploy using Visual Studio, or if you want to do it manually try:

CREATE ASSEMBLY MatchesAssembly FROM 'c:\somewhere\some.dll' WITH PERMISSION_SET = SAFE

CREATE FUNCTION Matches(@text NVARCHAR(MAX), @term NVARCHAR(MAX))
RETURNS TABLE
(Matches NVARCHAR(MAX))
AS
EXTERNAL NAME TextTools.UserDefinedFunctions.Matches

And voila, you can do the following...

select *

from terms cross apply Matches(@text,'\b(' + terms.term + ')\b')

Where terms is a table of noun phrases you're searching for in the variable @text that contains the message text.

No comments: