Friday, March 10, 2017

Parse a html page

A colleague of mine had to give a demo about logfile analyses and asked for my help in parsing an html page.  The html page shows data about the Belgian Air Polution (http://deus.irceline.be/~celinair/no2/no2.php?lan=nl#tabel).  He wanted a script that would create a log file of this data.  Nothing fancy, just some regex manipulations, some looping and your all set.

here is the code


$polutionTable = Invoke-RestMethod -Uri http://deus.irceline.be/~celinair/no2/no2.php?lan=nl#tabel -Method Get

function remove($t,$c){
    $list = $c -split ","
    foreach($l in $list){
        $t = $t -replace $l,""
    }
    return $t
}
function removetag($t,$c){
    $list = $c -split ","
    foreach($l in $list){
        $t = $t -replace "<$l[^>]*>",""
        $t = $t -replace "<\/$l>",""
    }
    return $t
}
function removeattr($t,$c){
    $list = $c -split ","
    foreach($l in $list){
        $t = $t -replace "\s$l=`"[^`"]*`"",""
    }
    return $t
}
function removecomment($t){
    $t = $t -replace "<\!--(?:(?!-->).)*-->",""
    return $t
}

# remove line breaks & space entities
$polutionTable = remove $polutionTable "`n,`r,&nbsp;,&nbsp"

# cleanup html
$polutionTable = removeattr $polutionTable "valign,align,bgcolor,border"
$polutionTable = removetag $polutionTable "title,strong,head,i,ul,li,script,img,map,hr,small,font,p,b,a,CENTER"
$polutionTable = removecomment $polutionTable

# split again on table rows
$polutionTable = $polutionTable -replace "<TR>","`n<TR>"
$polutionTable = $polutionTable -replace "</TR>","</TR>`n"

$list = @()

foreach($line in ($polutionTable -split "`n")){
    if($line -match "^<TR><TD>(.*)<\/TD><TD>(.*)<\/TD><TD>(.*)<\/TD><TD>(.*)<\/TD><TD>(.*)<\/TD><TD>(.*)<\/TD><TD>(.*)<\/TD><\/TR>$"){
        $o = @{
            code = $Matches[1];
            stad = $Matches[2].trim().PadRight(30);
            prov = $Matches[3].trim().PadRight(10);
            cur = $Matches[4].trim().PadLeft(10,"0");
            top = $Matches[5].trim().PadLeft(10,"0");
            from = $Matches[6].trim().PadRight(5);
            to = $Matches[7].trim().PadRight(5);
        }
        $list += (New-Object -TypeName PSObject -Property $o)
    }
}

$timestamp = $(get-date -f MM-dd-yyyy_HH_mm_ss)
$date = $(get-date -f MM-dd-yyyy)
$filename = ".\polution_$timestamp.log"
foreach($l in $list){
    "$date $($l.from) $($l.to) $($l.prov) $($l.stad) $($l.cur) $($l.top)" | Out-File $filename -Append
}

1 comment :