Basics of Regular Expressions and PowerShell
While the real magic of PowerShell is the ability to work with streams of objects instead of just text, it is still quite good at text processing. If you do much text processing at all, you will know one of the key tools in your toolbox is the good old-fashioned regular expression.
In my day-to-day work I find myself using regular expressions in PowerShell constantly, so I thought I would write up a bit of a primer on how it works.
.NET support for regular expressions is excellent, which of course translates to good support in PowerShell. In fact there are some nice convenience features in PowerShell that make using regexes a very natural part of the command-line experience. We will cover those features shortly.
If you don’t know regex I’m not going to go into much detail about how regular expressions work, but if you struggle with them then I highly, highly recommend you read what I consider THE book on the topic, Mastering Regular Expressions by Jeffrey Friedl.
This book changed my life back in the day when I was struggling to understand just how to make use of regex. Read the first four chapters and you will have a solid grasp on how to use this powerful tool.
Regular Expressions in .NET
Support for regex in .NET comes from the System.Text.RegularExpressions.Regex
type, which you can use directly in PowerShell if desired. Here is a trivial example that will match any string that consists of exactly one character:
You don’t actually need to type out the entire fully-qualified type name; [regex]
can be used to access the type as well, as in:
The simplest way to use a regular expression instance is to call IsMatch, as shown above, to see if the regex matched on the input. This can be used in a Where-Object pipeline to filter a sequence of text values:
If you need to do something more sophisticated, the Match method can be used to get back a Match object. In this example we will match on a string that consists of exactly three comma-separated digits, and retrieve the second digit.
Working with a regex object directly in this way gives you, of course, all of the power and flexibility provided by the .NET regular expression API. For many common text processing scenarios, though, we don’t need to deal with the complexity of directly manipulating Match objects and their properties. PowerShell provides some nice built-in operators to make it much simpler to do our regex processing.
The match operator
The primary way to use regex through PowerShell is via the match
operator. The match operator take a regular expression pattern, such as our ^.$
example, and applies it to some text, returning true if there is a match and false if there is not. As a side-effect, it also automatically populates a special variable called $matches
, which can be used to access the values returned by the overall match and any matching groups.
Here is a simple example, re-using our “3 digits separated by commas” regex from earlier:
Here we passed a series of five strings through the pipeline and used Where-Object to filter for anything matching our regex. Only the second string matches the pattern we care about, so that is the single item that comes out of the pipeline. Since our regular expression used parentheses to define three separate capture groups, we can now access, for instance, the value in the second group through the aforementioned $matches object:
Note that the match at index zero will be the overall match, so the first capture group starts at index 1, and so on.
In the case of streaming data through the PowerShell pipeline, $matches will contain the result of the most recent match; i.e., the last item in the pipeline that matched the regular expression.
The $matches
object, incidentally, is simply a System.Collections.Hashtable
mapping integer keys (the match index) to string values (the matched text.)
The notmatch operator
There is a complementary operator to match
called notmatch
, which does pretty much what you would expect:
What might not be obvious is that the $matches variable will still be populated by the notmatch operator. In this example there were two matching strings, “0,1,2” and “9,9,9” so we would expect $matches to contain the latter, which it does:
A real-world example
Here is an example of a real-world text processing problem that came up at work recently.
We have a large blob of text that looks something like this:
This text is the output of some diagnostic command, and we need to extract the unique set of GUIDs from it. I know plenty of people whose first instinct would be to dump this data into Excel and use that to do the filtering, but PowerShell can make quick work of this particular problem, using the techniques we have been discussing.
First we need a regex to match the data we are interested in; in this case, a good starting point would be to have a regex that will match a GUID. It’s clear that each GUID has a very specific format:
We have groups of hexadecimal digits separated by dashes, in the pattern:
8-4-4-4-12
(Where each number is the count of consecutive hex digits.)
This can be represented quite exactly with a regular expression like so:
It’s also possible to use the nice string interpolation feature of PowerShell to compose the regex in a way that makes it clear we are re-using the same set of characters in a specific pattern. For instance, it might be easier to comprehend if we wrote the above as:
These two approaches are entirely equivalent; which you prefer is a matter of taste.
In any event, given a regex to match a GUID, we can then compose that into a regex that matches two GUIDs separated by a comma and a space, which is the pattern in our blob of text that we care about. Again, using string interpolation:
Mission accomplished! We have extracted each unique GUID from the blob of diagnostic output with a simple one-liner PowerShell script.
In the example above we filter for matches using our composed regular expression and, for each match, we write the contents of each capture group separately to the pipeline:
Note the semi-colon that is used to put two separate objects into the output pipeline.
As an aside, I’ve noticed some people are unaware of this technique to produce multiple output objects in the output pipeline for each item in the input pipeline. It can be useful in many circumstances; for instance, let’s say we want to output the value, square and cube of each input item in a sequence of numbers:
Although that’s kind of a trivial little example, the technique has many useful applications. The more you work with PowerShell and get used to the idea of composing using pipelines of objects (which is what allows us to naturally apply mathematical operations to our input objects in the above example) the happier you will be as you solve those little day-to-day problems with PowerShell.
A final word of caution
If you happen to be new to regular expressions, it is very easy to fall into the trap of trying to use them for everything. While they have splendidly useful applications across many types of text processing problems, they can be a terrible choice in many cases. Use them judiciously.
I urge you to read this very well-thought-out Stack Overflow answer on this particular topic.