Friday, June 04, 2010

IP Address to Country Map

I’m learning how to use the F# HashMultiMap and I thought up a sample exercise – do a lookup in memory of IP address to a country code. This is a task I’ve seen embedded in a relational table before but that seemed daft – it’s on disk so it will be slow. It seems smarter to do this all in memory and just serve out a result.

It’s relatively easy to get hold of IP address country/region/ISP lookups in the form of a single file, for example provide one for country lookups under a GNU license.

First I load up the text file:

open System
open System.IO
open System.Text.RegularExpressions

let (|ActiveRegex|_|) regex str =
let ms = Regex(regex).Matches(str)
if ms.Count > 0
then Some ((Seq.cast ms : Match seq))
else None

let matches s re =
match s with
| ActiveRegex re results -> results
| _ -> Seq.empty

let capturesSeq s p =
for m in matches s p ->
Seq.skip 1 (seq{for g in m.Groups -> g.Value})
|> Seq.concat

let csvRegex = "\"([\w\s:;'`~!@#$%\^&\*_<>,\.\\\/\|\[\]\{\}\(\)\-\+\?]*)(?:\",|\"$)"

let isInt64 i =
let v,_ = Int64.TryParse(i)

let parseIpToCountryLine lineNo (line:String) =
values =
capturesSeq line csvRegex
|> Seq.toArray

isInt64 values.[0] |> fun test -> if not test then failwith (sprintf "Bad IP FROM on line %i" lineNo)
isInt64 values.[1] |> fun test -> if not test then failwith (sprintf "Bad IP TO on line %i" lineNo)
int64 values.[0], int64 values.[1], string values.[6]
| :? System.IndexOutOfRangeException -> failwithf "Failed on line %A, contents: %A" lineNo line

let IpToCountryLines = File.ReadAllLines(@"c:\temp\IpToCountry.csv", Text.Encoding.Default)

File format like this... detail in the header comments of the file
# "1346797568","1346801663","ripencc","20010601","il","isr","Israel"

there is a tricky line in there, it contains an 'Å' character which seems to require that I explicitly define the text encoding. Don't see why since it's just default...

let getIpToCountry (lines:string []) =
|> Seq.filter (fun i -> not(i.StartsWith("#")))
|> Seq.mapi parseIpToCountryLine

let Ip2Country = getIpToCountry IpToCountryLines

Then a function to create a numeric form of an IP address:

let numIP (a:int) (b:int) (c:int) (d:int) = (int64 d) + ((int64 c)*256L) + ((int64 b)*256L*256L) + ((int64 a)*256L*256L*256L)


Then the hash and a quick test:

#r "FSharp.PowerPack.dll"
// count down until a key is found - could be speed up by creating extra entries when to-from is a large gap
let Ip2Country3 = HashMultiMap<_,_>(
|> (fun (ipFrom,ipTo,country) -> ipFrom, country )
, HashIdentity.Structural)

let rec countryFromIP (ip:int64) =
if Ip2Country3.ContainsKey(ip) then
countryFromIP (ip-1L)

// test with 10 addresses taken from our public site web logs
let testIPs = [… in the interests of confidentiality stick some of your own in here… as (a,b,c,d) tuples…]

let time f x =
let timer = System.Diagnostics.Stopwatch.StartNew()
try f x finally
printf "Execution duration: %gms\n" timer.Elapsed.TotalMilliseconds;;

testIPs |> (fun (a,b,c,d) -> numIP a b c d) |> time (fun nIP -> countryFromIP nIP)

I’ve found it takes about 1 millisecond for 10 random lookups on my virtual workstation instance (running under Hyper-V with 3GB allocated on a desktop Dell Optiplex 760).