Monday, January 18, 2010

PowerShell Data Munging

There’s a data munging exercise on Rosetta Code which was missing a PowerShell solution. Since I spend a lot of time importing data to graph and present in my day job – often using PowerShell, I thought I’d add a couple of PowerShell options for the import of the data in the Data Munging 2 problem.

First just using iteration alone to look up good values.

$dateHash = @{}
$goodLineCount = 0
get-content c:\temp\readings.txt
ForEach-Object {
$line = $_.split(" `t",2)
if ($dateHash.containskey($line[0])) {
$line[0] + " is duplicated"
} else {
$dateHash.add($line[0], $line[1])
}
# split up the 24 instrument values and count the total number of entries with flag >=1
$readings = $line[1].split()
$goodLine = $true
if ($readings.count -ne 48) { $goodLine = $false; "incorrect line length : $line[0]" }
for ($i=0; $i -lt $readings.count; $i++) {
if ($i % 2 -ne 0) {
if ([int]$readings[$i] -lt 1) {
$goodLine = $false
}
}
}
if ($goodLine) { $goodLineCount++ }
}
$goodLineCount




And secondly taking advantage of the regular expression syntax.



$dateHash = @{}
$goodLineCount = 0
ForEach ($rawLine in ( get-content c:\temp\readings.txt) ){
$line = $rawLine.split(" `t",2)
if ($dateHash.containskey($line[0])) {
$line[0] + " is duplicated"
} else {
$dateHash.add($line[0], $line[1])
}
$readings = [regex]::matches($line[1],"\d+\.\d+\s-?\d")
if ($readings.count -ne 24) { "incorrect number of readings for date " + $line[0] }
$goodLine = $true
foreach ($flagMatch in [regex]::matches($line[1],"\d\.\d*\s(?<flag>-?\d)")) {
if ([int][string]$flagMatch.groups["flag"].value -lt 1) {
$goodLine = $false
}
}
if ($goodLine) { $goodLineCount++}
}
[string]$goodLineCount + " good lines"




The F# solution on the site includes use of seq.forall – I thought maybe this would be useful to implement for the PowerShell solution as well – but I couldn't figure it out. Fortunately the good folks at stackoverflow helped out on that...