Writing Regular Expressions in #PowerShell like a Pro
Published on 15 Feb 2018Tags #PowerShell #RegEx
Regular expressions are often considered the holy grail of parsing data. Regexes are very powerful but most of them are unreadable as well as seldomly documented. But with great power comes great responsibility. I will demonstrate how to write complex regular expressions, make them readable and even include proper documentation.
Typical RexEx
I recently had to parse the access log of web server. Web servers offer different format for this access log. The common log format produced lines like the following:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
You may be tempted to use whitespaces as the delimiter but this will fail because the timestamp used a whitespace to separate the date/time and the timezone. A quick search on the web produces regular expressions similar to the following:
/^(.+)\s(.+)\s(.+)\s\[(.+)\]\s\"(.+)\s(.+)\s(.+)\"\s(.+)\s(.+)$/
… or …
/^(\S+)\s(\S+)\s(\w+)\s\[([^\]])\]\s\"(\S+)\s(\S+)\s([^\]]+)\"\s(\d+)\s(\d+|-)$/
All of them are very hard to read.
RexExes like a Pro
In PowerShell you will usually use a construct similar to the following:
if ($Data -match $Pattern) {
$Matches
}
The following pattern parses the common log format, is easier to read and includes documentation:
$Pattern = '(?x)
^ # Beginning of the line
(?<SourceIp>\S+) # IP address
\s # field separator
\S+ # identd username (deprecated)
\s # field separator
(?<User>\S+) # username provided by HTTP auth
\s # field separator
\[(?<Timestamp>[^]]+)\] # date enclosed in brackets
\s # field separator
(?<Request>".+") # request enclosed in quotation marks
\s # field separator
(?<Code>\d+) # HTTP return code
\s # field separator
(?<Size>\d+|-) # Size of response
$ # End of the line
'
At the beginning, the pattern enables extended mode by specifying (?x)
. It forces the parser to ignore whitespaces (space, tab and newlines) and enables comments using #
. Each important item is assigned a name by using (?<Name>*subexpression*)
. PowerShell make all matches available in a hashtable called $Matches
but - instead of assigning an index to every expression enclosed in brackets - those matches are assigned the specified name:
PS> $Matches
Name Value
---- -----
Size 2326
User frank
SourceIp 127.0.0.1
Timestamp 10/Oct/2000:13:55:36 -0700
Code 200
Request "GET /apache_pb.gif HTTP/1.0"
0 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
As a side effect, this also makes your code more readable because it becomes obvious which part of the expression you are refencing:
PS> [datetime]::ParseExact(
$Matches['Timestamp'],
'dd/MMM/yyyy:HH:mm:ss zz00',
[System.Globalization.CultureInfo]::InvariantCulture
).ToString([Globalization.CultureInfo]'en-US')
10/10/2000 10:55:36 PM
Happy regex’ing!