For a recent pet project I've been attempting to create a text mining model in SQL Server 2005 to analyse incoming messages and automatically bucket them into one of a number of categories. This follows straight on from the text mining example contained in the SQL Server tutorials.
With the model implemented (using Decision Trees, Naive Bayes etc) it's easy to create a singleton query by hand that has the following form:
SELECT
[Model_NB].[Bucket],
TopCount(PredictHistogram([Bucket]), $AdjustedProbability, 3)
From
[Model_NB]
NATURAL PREDICTION JOIN
(SELECT (SELECT 'some defining term' AS [Term]
UNION SELECT 'another identifying noun or phrase' AS [Term]) AS [Msg Term Vectors]) AS t
But I still needed to extract the identifying noun phrases that make up the terms. Given a dictionary of terms and a length of freeform text how do you find all the term occurences?
Using the SQL Server string functions is painful so I thought I'd try the Match Collection object in the CLR. To expose this you need to perform the following operations.
Firstly, enable CLR integration with
EXEC sp_configure 'clr enabled' , '1'
Then create a SqlServer function in .net to expose the MatchCollection eg
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction (FillRowMethodName="RowFiller", TableDefinition="Match NVARCHAR(MAX)")]
public static IEnumerable Matches(String text, String pattern)
{
MatchCollection mc = Regex.Matches(text, pattern);
return mc;
}
// this gets called once each time the framework calls the iterator on the underlying matchcollection
public static void RowFiller(object row, out string MatchTerm) {
Match m1 = (Match)row;
MatchTerm = m1.Value;
}
};
Then deploy using Visual Studio, or if you want to do it manually try:
CREATE ASSEMBLY MatchesAssembly FROM 'c:\somewhere\some.dll' WITH PERMISSION_SET = SAFE
CREATE FUNCTION Matches(@text NVARCHAR(MAX), @term NVARCHAR(MAX))
RETURNS TABLE
(Matches NVARCHAR(MAX))
AS
EXTERNAL NAME TextTools.UserDefinedFunctions.Matches
And voila, you can do the following...
select *
from terms cross apply Matches(@text,'\b(' + terms.term + ')\b')
Where terms is a table of noun phrases you're searching for in the variable @text that contains the message text.
No comments:
Post a Comment