Friday, June 04, 2010

IP Address to Country Map

I’m learning how to use the F# HashMultiMap and I thought up a sample exercise – do a lookup in memory of IP address to a country code. This is a task I’ve seen embedded in a relational table before but that seemed daft – it’s on disk so it will be slow. It seems smarter to do this all in memory and just serve out a result.

It’s relatively easy to get hold of IP address country/region/ISP lookups in the form of a single file, for example http://software77.net/geo-ip/ provide one for country lookups under a GNU license.

First I load up the text file:

open System
open System.IO
open System.Text.RegularExpressions

let (|ActiveRegex|_|) regex str =
let ms = Regex(regex).Matches(str)
if ms.Count > 0
then Some ((Seq.cast ms : Match seq))
else None

let matches s re =
match s with
| ActiveRegex re results -> results
| _ -> Seq.empty

let capturesSeq s p =
seq{
for m in matches s p ->
Seq.skip 1 (seq{for g in m.Groups -> g.Value})
}
|> Seq.concat


let csvRegex = "\"([\w\s:;'`~!@#$%\^&\*_<>,\.\\\/\|\[\]\{\}\(\)\-\+\?]*)(?:\",|\"$)"

let isInt64 i =
let v,_ = Int64.TryParse(i)
v

let parseIpToCountryLine lineNo (line:String) =
try
let
values =
capturesSeq line csvRegex
|> Seq.toArray

isInt64 values.[0] |> fun test -> if not test then failwith (sprintf "Bad IP FROM on line %i" lineNo)
isInt64 values.[1] |> fun test -> if not test then failwith (sprintf "Bad IP TO on line %i" lineNo)
int64 values.[0], int64 values.[1], string values.[6]
with
| :? System.IndexOutOfRangeException -> failwithf "Failed on line %A, contents: %A" lineNo line

let IpToCountryLines = File.ReadAllLines(@"c:\temp\IpToCountry.csv", Text.Encoding.Default)

(*
File format like this... detail in the header comments of the file
# IP FROM IP TO REGISTRY ASSIGNED CTRY CNTRY COUNTRY
# "1346797568","1346801663","ripencc","20010601","il","isr","Israel"

there is a tricky line in there, it contains an 'Å' character which seems to require that I explicitly define the text encoding. Don't see why since it's just default...
*)

let getIpToCountry (lines:string []) =
lines
|> Seq.filter (fun i -> not(i.StartsWith("#")))
|> Seq.mapi parseIpToCountryLine

let Ip2Country = getIpToCountry IpToCountryLines

Then a function to create a numeric form of an IP address:

let numIP (a:int) (b:int) (c:int) (d:int) = (int64 d) + ((int64 c)*256L) + ((int64 b)*256L*256L) + ((int64 a)*256L*256L*256L)


 


Then the hash and a quick test:

#r "FSharp.PowerPack.dll"
// count down until a key is found - could be speed up by creating extra entries when to-from is a large gap
let Ip2Country3 = HashMultiMap<_,_>(
Ip2Country
|> Seq.map (fun (ipFrom,ipTo,country) -> ipFrom, country )
, HashIdentity.Structural)

let rec countryFromIP (ip:int64) =
if Ip2Country3.ContainsKey(ip) then
Ip2Country3.[ip]
else
countryFromIP (ip-1L)


// test with 10 addresses taken from our public site web logs
let testIPs = [… in the interests of confidentiality stick some of your own in here… as (a,b,c,d) tuples…]

let time f x =
let timer = System.Diagnostics.Stopwatch.StartNew()
try f x finally
printf "Execution duration: %gms\n" timer.Elapsed.TotalMilliseconds;;

testIPs |> List.map (fun (a,b,c,d) -> numIP a b c d) |> time List.map (fun nIP -> countryFromIP nIP)

I’ve found it takes about 1 millisecond for 10 random lookups on my virtual workstation instance (running under Hyper-V with 3GB allocated on a desktop Dell Optiplex 760).

1 comment:

Markus Lindqvist said...

Hi there!
I just found out that you had implemented what I was working couple of days ago.

https://gist.github.com/1078445

My first implementation wasn't too fast so the Gist page includes a link to another one which splits the addresses in smaller arrays to speed up the lookup.