zigford.org

About | Links | Scripts
Sharing linux/windows scripts and tips

When it's not cool to use PowerShell.

October 10, 2020 — Jesse Harris

PowerShell is an extremely productive language. The pipeline looping combined with regular expressions and custom objects allow you to quickly build a one-liner that make it tempting to use for data processing. But when it comes to larger data sets, your better off looking to something else.


I recently had the need to analyse a 500+ Mb text file consisting of DNS logs. 500Mb isn't a great deal. It's not like it could be a candidate for using a hadoop cluster. So I naturally gravitated to the language I'm most familiar with: PowerShell.

I use PowerShell enough that I can quickly hack together a one-liner to produce the information I need within 10 to 15 minutes and so it's a natural choice for me. However, after playing with a dataset a little larger, it reminded me of one of PowerShell's flaws. Speed.

Here is the PowerShell used to parse the file:

        Get-Content $LogFile | Where-Object {$_ -match "^\d"}| ForEach-Object {
            $split_data = $_ -split " +"
            $date = "$($split_data[0]) $($split_data[1]) $($split_data[2])"
            [PSCustomObject]@{
                Date = $date
                Client = $split_data[8]
                Direction = $split_data[7]
                Type = $split_data[-2]
                Request = ($split_data[-1] -replace "\(\d+\)", ".").TrimStart(".")
                Response = $split_data[-3].TrimEnd(']')
            }
        }

nothing fancy, just some regex, splitting and replacing on lines

This command took 37 minutes to completely parse the file. My intention was to store the output on a variable and then use Group-Object in various ways to poke at the data.

After the dismal performance of PowerShell I thought I'd have a crack at analysing the data using awk.

AWK is a fun little language whose syntax and function list are so small it's quite easy to grasp. It doesn't have anywhere near the built-in tools of PowerShell, but there are enough little tricks you can do with it that it's fun to play with.

So I spent about an hour fiddling with awk and came up with this:

        #!/usr/bin/awk -f

        # parsing a windows file. This will trim the CRLF's from 
        # the last fields
        BEGIN{RS="\r\n"}

        # Convert non-standard windowsy date into unixy
        # We are even calling gnu date on each line to
        # do this. This should be expensive.
        function c_date(d,t,p) {
            split(d,ds,/\//)
            split(t,ts,/:/)
            y=ds[3]
            m=ds[2]
            d=ds[1]
            if (p=="PM") {h=ts[1]+12} else {
                if (ts[1] < 10) {
                    h="0" ts[1]
                } else { h=ts[1] }
            }
            M=ts[2]
            s=ts[3]
            d=y"-"m"-"d
            t=h":"M":"s
            # build the command for date
            c="date -d \"" d " " t "\" |tr -d '\r\n'"
            # set the date to the date var
            c |& getline date
        }

        /^[0-9]/ {
            # regex replace stuff was in PowerShell ver too
            gsub(/\([0-9]+\)/,".",$NF)
            # I should have split this to a func,
            # but here I am reversing the domain
            # so I can sort by it.
            split($NF,dom,/\./)
            newdom=""
            for (i=length(dom)-1;i>0;i--) {
                newdom=newdom "." dom[i]
            }
            c_date($1,$2,$3)
            sub(/\]/,"",$(NF-2))
            # printing all the bits now
            print newdom,date,$8,$9,$(NF-1),$(NF-2)
        }

After all that extra processing in the awk script (reversing domain, converting date), I then agregate and sort data by piping back into awk:

            ./parse_dns.awk dnslog.log | awk '{a[$1]++}END{for (i in a) {
                print a[i],i }}' | sort -n

This bit creates an hash a and each time a domain was queries adds 1 to it. Then at the END we can print the number of times each domain was queried and sort them to get the heavy hitters.

Including grouping domain requests and sorting by pipeline after this script is running, total run time was 17seconds.

Closing thoughts

If you don't use awk all the time it can take a fair bit longer to develop the solution, but it is fun and if you ever use that script again, you will make up the lost time.

Tags: powershell, awk