Regular Expressions in Powershell

roualin powershell_2 Regular Expressions in PowershellUse regular expressions for more accurate pattern recognition if you require it. Regular expressions offer many more wildcard characters; for this reason, they can describe patterns in much greater detail. For the very same reason, however, regular expressions are also much more complicated.

Describing Patterns

Using the regular expression elements, you can describe patterns with much greater precision. These elements are grouped into three categories:

  • Char: The Char represents a single character and a collection of Char objects represents a string.
  • Quantifier: Allows you to determine how often a character or a string occurs in a pattern.
  • Anchor: Allows you to determine whether a pattern is a separate word or must be at the beginning or end of a sentence.

The pattern represented by a regular expression may consist of four different character types:

  • Literal characterslike “abc” that exactly matches the “abc” string.
  • Masked or “escaped” characters with special meanings in regular expressions; when preceded by “”, they are understood as literal characters: “[test]” looks for the “[test]” string. The following characters have special meanings and for this reason must be masked if used literally: “. ^ $ * + ? { [ ] | ( )”.
  • Predefined wildcard charactersthat represent a particular character category and work like placeholders. For example, “d” represents any number from 0 to 9.
  • Custom wildcard characters: They consist of square brackets, within which the characters are specified that the wildcard represents. If you want to use any character except for the specified characters, use “^” as the first character in the square brackets. For example, the placeholder “[^f-h]” stands for all characters except for “f”, “g”, and “h”.
ElementDescription
.Exactly one character of any kind except for a line break (equivalent to [^n])
[^abc]All characters except for those specified in brackets
[^a-z]All characters except for those in the range specified in the brackets
[abc]One of the characters specified in brackets
[a-z]Any character in the range indicated in brackets
aBellalarm (ASCII 7)
cAny character allowed in an XML name
cA-cZControl+A to Control+Z, equivalent to ASCII 0 to ASCII 26
dA number (equivalent to [0-9])
DAny character except for numbers
eEscape (ASCII 9)
fForm feed (ASCII 15)
nNew line
rCarriage return
sAny whitespace character like a blank character, tab, or line break
SAny character except for a blank character, tab, or line break
tTab character
uFFFFUnicode character with the hexadecimal code FFFF. For example, the Euro symbol has the code 20AC
vVertical tab (ASCII 11)
wLetter, digit, or underline
WAny character except for letters
xnnParticular character, where nn specifies the hexadecimal ASCII code
.*Any number of any character (including no characters at all)

Table 13.8: Placeholders for characters

Quantifiers

Every wildcard listed in Table 13.8 is represented by exactly one character. Using quantifiers, you can more precisely determine how many characters are respectively represented. For example, “d{1,3}” stands for a number occurring one to three times for a one-to-three digit number.

ElementDescription
*Preceding expression is not matched or matched once or several times (matches as much as possible)
*?Preceding expression is not matched or matched once or several times (matches as little as possible)
.*Any number of any character (including no characters at all)
?Preceding expression is not matched or matched once (matches as much as possible)
??Preceding expression is not matched or matched once (matches as little as possible)
{n,}n or more matches
{n,m}Inclusive matches between n and m
{n}Exactly n matches
+Preceding expression is matched once

Table 13.9: Quantifiers for patterns

Anchors

Anchors determine whether a pattern has to be at the beginning or ending of a string. For example, the regular expression “bd{1,3}” finds numbers only up to three digits if these turn up separately in a string. The number “123” in the string “Bart123” would not be found.

ElementsDescription
$Matches at end of a string (Z is less ambiguous for multi-line texts)
AMatches at beginning of a string, including multi-line texts
bMatches on word boundary (first or last characters in words)
BMust not match on word boundary
ZMust match at end of string, including multi-line texts
^Must match at beginning of a string (A is less ambiguous for multi-line texts)

Table 13.10: Anchor boundaries

Recognizing IP Addresses

The patterns, such as an IP address, can be much more precisely described by regular expressions than by simple wildcard characters. Usually, you would use a combination of characters and quantifiers to specify which characters may occur in a string and how often:

$ip = “10.10.10.10”
$ip -match “bd{1,3}.d{1,3}.d{1,3}.d{1,3}b”

True
$ip = “a.10.10.10”
$ip -match “bd{1,3}.d{1,3}.d{1,3}.d{1,3}b”

False
$ip = “1000.10.10.10”
$ip -match “bd{1,3}.d{1,3}.d{1,3}.d{1,3}b”

False

The pattern is described here as four numbers (char: d) between one and three digits (using the quantifier {1,3}) and anchored on word boundaries (using the anchor b), meaning that it is surrounded by white space like blank characters, tabs, or line breaks. Checking is far from perfect since it is not verified whether the numbers really do lie in the permitted number range from 0 to 255.

# There still are entries incorrectly identified as valid IP addresses:
$ip = “300.400.500.999”
$ip -match “bd{1,3}.d{1,3}.d{1,3}.d{1,3}b”

True

Validating E-Mail Addresses

If you’d like to verify whether a user has given a valid e-mail address, use the following regular expression:

$email = “[email protected]
$email -match “b[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}b”

True
$email = “[email protected]
$email -match “b[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}b”

False

Whenever you look for an expression that occurs as a single “word” in text, delimit your regular expression by word boundaries (anchor: b). The regular expression will then know you’re interested only in those passages that are demarcated from the rest of the text by white space like blank characters, tabs, or line breaks.

The regular expression subsequently specifies which characters may be included in an e-mail address. Permissible characters are in square brackets and consist of “ranges” (for example, “A-Z0-9”) and single characters (such as “._%+-“). The “+” behind the square brackets is a quantifier and means that at least one of the given characters must be present. However, you can also stipulate as many more characters as you wish.

Following this is “@” and, if you like, after it a text again having the same characters as those in front of “@”. A dot (.) in the e-mail address follows. This dot is introduced with a “” character because the dot actually has a different meaning in regular expressions if it isn’t within square brackets. The backslash ensures that the regular expression understands the dot behind it literally.

After the dot is the domain identifier, which may consist solely of letters ([A-Z]). A quantifier ({2,4}) again follows the square brackets. It specifies that the domain identifier may consist of at least two and at most four of the given characters.

However, this regular expression still has one flaw. While it does verify whether a valid e-mail address is in the text somewhere, there could be another text before or after it:

$email = “Email please to [email protected] and reply!”
$email -match “b[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}b”

True

Because of “b”, when your regular expression searches for a pattern somewhere in the text, it only takes into account word boundaries. If you prefer to check whether the entire text corresponds to an authentic e-mail, use the elements for sentence beginnings (anchor: “^”) and endings (anchor: “$”):instead of word boundaries.

$email -match “^[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}$”

Simultaneous Searches for Different Terms

Sometimes, search terms are ambiguous because there may be several ways to write them. You can use the “?” quantifier to mark parts of the search term as optional. In simple cases, put a “?” after an optional character. Then the character in front of “?” may, but doesn’t have to, turn up in the search term:

“color” -match “colou?r”
True
“colour” -match “colou?r”
True

The “?” character here doesn’t represent any character at all, as you might expect after using simple wildcards. For regular expressions, “?” is a quantifier and always specifies how often a character or expression in front of it may occur. In the example, therefore, “u?” ensures that the letter “u” may, but not necessarily, be in the specified location in the pattern. Other quantifiers are “*” (may also match more than one character) and “+” (must match characters at least once).

If you prefer to mark more than one character as optional, put the character in a sub-expression, which are placed in parentheses. The following example recognizes both the month designator “Nov” and “November”:

“Nov” -match “bNov(ember)?b”

True

“November” -match “bNov(ember)?b”

True

If you’d rather use several alternative search terms, use the OR character “|”:

“Bob and Ted” -match “Alice|Bob”

True

And if you want to mix alternative search terms with fixed text, use sub-expressions again:

# finds “and Bob”:
“Peter and Bob” -match “and (Bob|Willy)”

True

# does not find “and Bob”:
“Bob and Peter” -match “and (Bob|Willy)”

False

Case Sensitivity

In keeping with customary PowerShell practice, the -match operator is case insensitive. Use the operator -cmatch as alternative if you’d prefer case sensitivity.:

# -match is case insensitive:
“hello” -match “heLLO”

True
# -cmatch is case sensitive:
“hello” -cmatch “heLLO”

False

If you want case sensitivity in only some pattern segments, use -match. Also, specify in your regular expression which text segments are case sensitive and which are insensitive. Anything following the “(?i)” construct is case insensitive. Conversely, anything following “(?-i)” is case sensitive. This explains why the word “test” in the below example is recognized only if its last two characters are lowercase, while case sensitivity has no importance for the first two characters:

“TEst” -match “(?i)te(?-i)st”

True
“TEST” -match “(?i)te(?-i)st”

False

If you use a .NET framework RegEx object instead of -match, the RegEx object will automatically sense shifts between uppercase and lowercase, behaving like -cmatch. If you prefer case insensitivity, either use the above construct to specify an option in your regular expression or avail yourself of “IgnoreCase” to tell the RegEx object your preference:

[regex]::matches(“test”, “TEST”, “IgnoreCase”)

ElementDescriptionCategory
(xyz)Sub-expression
|Alternation constructSelection
When followed by a character, the character is not recognized as a formatting character but as a literal characterEscape
x?Changes the x quantifier into a “lazy” quantifierOption
(?xyz)Activates of deactivates special modes, among others, case sensitivityOption
x+Turns the x quantifier into a “greedy” quantifierOption
?:Does not backtrackReference
?<name>Specifies name for back referencesReference

Table 13.11: Regular expression elements

Of course, a regular expression can perform any number of detailed checks, such as verifying whether numbers in an IP address lie within the permissible range from 0 to 255. The problem is that this makes regular expressions long and hard to understand. Fortunately, you generally won’t need to invest much time in learning complex regular expressions like the ones coming up. It’s enough to know which regular expression to use for a particular pattern. Regular expressions for nearly all standard patterns can be downloaded from the Internet. In the following example, we’ll look more closely at a complex regular expression that evidently is entirely made up of the conventional elements listed in Table 13.11:

$ip = “300.400.500.999”
$ip -match “b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).)” +
"{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)b"

False

The expression validates only expressions running into word boundaries (the anchor is b). The following sub-expression defines every single number:

(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)

The construct ?: is optional and enhances speed. After it come three alternatively permitted number formats separated by the alternation construct "|". 25[0-5] is a number from 250 through 2552[0-4][0-9] is a number from200 through 249. Finally, [01]?[0-9][0-9]? is a number from 0-9 or 00-99 or 100-199. The quantifier "?" ensures that the preceding pattern must be included. The result is that the sub-expression describes numbers from 0 through 255. An IP address consists of four such numbers. A dot always follows the first three numbers. For this reason, the following expression includes a definition of the number:

(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}

A dot, (.), is appended to the number. This construct is supposed to be present three times ({3}). When the fourth number is also appended, the regular expression is complete. You have learned to create sub-expressions (by using parentheses) and how to iterate sub-expressions (by indicating the number of iterations in braces after the sub-expression), so you should now be able to shorten the first used IP address regular expression:

$ip = "10.10.10.10"
$ip -match "bd{1,3}.d{1,3}.d{1,3}.d{1,3}b"

True

$ip -match "b(?:d{1,3}.){3}d{1,3}b"

True

Finding Information in Text

Regular expressions can recognize patterns. They can also filter out data corresponding to certain patterns from text. As such, regular expressions are excellent tools for parsing raw data. For example, use the same regular expression as the one above to identify e-mail addresses if you want to extract an e-mail address from a letter. Afterwards, look in the $matchesvariable to see which results were returned. The $matches variable is created automatically when you use the -matchoperator (or one of its siblings, like -cmatch).

$matches is a hash table (Chapter 4), so you can either output the entire hash table or access single elements in it by using their names, which you must specify in square brackets:

$rawtext = "If it interests you, my e-mail address is tobias[email protected]"

# Simple pattern recognition:
$rawtext -match "b[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}b"

True
# Reading data matching the pattern from raw text:
$matches

Name                           Value
----                           -----
0                              [email protected]

$matches[0]

[email protected]

Does that also work for more than one e-mail addresses in text? Unfortunately, it doesn't do so right away. The -matchoperator looks only for the first matching expression. So, if you want to find more than one occurrence of a pattern in raw text, you have to switch over to the RegEx object underlying the -match operator and use it directly.

In one essential respect, the RegEx object behaves unlike the -match operator. Case sensitivity is the default for the RegEx object, but not for -match. For this reason, you must put the "(?i)" option in front of the regular expression to eliminate confusion, making sure the expression is evaluated without taking case sensitivity into account.

# A raw text contains several e-mail addresses. -match finds the first one only:
$rawtext = "[email protected] sent an e-mail that was forwarded to [email protected]"
$rawtext -match "b[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}b"

True
$matches

Name                           Value
----                           -----
0                              [email protected]

# A RegEx object can find any pattern but is case sensitive by default:
$regex = [regex]"(?i)b[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}b"
$regex.Matches($rawtext)

Groups   : {[email protected]}
Success  : True
Captures : {[email protected]}
Index    : 4
Length   : 13
Value    : [email protected]

Groups   : {[email protected]}
Success  : True
Captures : {[email protected]}
Index    : 42
Length   : 13
Value    : [email protected]

# Limit result to e-mail addresses:
$regex.Matches($rawtext) | Select-Object -Property Value

Value
-----
[email protected]
[email protected]

# Continue processing e-mail addresses:
$regex.Matches($rawtext) | ForEach-Object { "found: $($_.Value)" }

found: [email protected]
found: [email protected]

Searching for Several Keywords

You can use the alternation construct "|" to search for a group of keywords, and then find out which keyword was actually found in the string:

"Set a=1" -match "Get|GetValue|Set|SetValue"

True

$matches

Name                           Value
----                           -----
0                              Set

$matches tells you which keyword actually occurs in the string. But note the order of keywords in your regular expression—it's crucial because the first matching keyword is the one selected. In this example, the result would be incorrect:

"SetValue a=1" -match "Get|GetValue|Set|SetValue"

True

$matches[0]

Set

Either change the order of keywords so that longer keywords are checked before shorter ones …:

"SetValue a=1" -match "GetValue|Get|SetValue|Set"

True

$matches[0]

SetValue

... or make sure that your regular expression is precisely formulated, and remember that you're actually searching for single words. Insert word boundaries into your regular expression so that sequential order no longer plays a role:

"SetValue a=1" -match "b(Get|GetValue|Set|SetValue)b"

True

$matches[0]

SetValue

It's true here, too, that -match finds only the first match. If your raw text has several occurrences of the keyword, use a RegExobject again:

$regex = [regex]"b(Get|GetValue|Set|SetValue)b"
$regex.Matches("Set a=1; GetValue a; SetValue b=12")

Groups   : {Set, Set}
Success  : True
Captures : {Set}
Index    : 0
Length   : 3
Value    : Set

Groups   : {GetValue, GetValue}
Success  : True
Captures : {GetValue}
Index    : 9
Length   : 8
Value    : GetValue

Groups   : {SetValue, SetValue}
Success  : True
Captures : {SetValue}
Index    : 21
Length   : 8
Value    : SetValue

Forming Groups

A raw text line is often a heaping trove of useful data. You can use parentheses to collect this data in sub-expressions so that it can be evaluated separately later. The basic principle is that all the data that you want to find in a pattern should be wrapped in parentheses because $matches will return the results of these sub-expressions as independent elements. For example, if a text line contains a date first, then text, and if both are separated by tabs, you could describe the pattern like this:

# Defining pattern: two characters separated by a tab
$pattern = "(.*)t(.*)"

# Generate example line with tab character
$line = "12/01/2009
tDescription”

# Use regular expression to parse line:
$line -match $pattern

True
# Show result:
$matches

Name                           Value
—-                           —–
2                              Description
1                              12/01/2009
0                              12/01/2009    Description

$matches[1]

12/01/2009
$matches[2]

Description

When you use sub-expressions, $matches will contain the entire searched pattern in the first array element named “0”. Sub-expressions defined in parentheses follow in additional elements. To make them easier to read and understand, you can assign sub-expressions their own names and later use the names to call results. To assign names to a sub-expression, type ? in parentheses for the first statement:

# Assign subexpressions their own names:
$pattern = “(?.*)t(?.*)”

# Generate example line with tab character:
$line = “12/01/2009tDescription"

# Use a regular expression to parse line:
$line -match $pattern

True
# Show result:
$matches

Name                    Value
----                    -----
Text                    Description
Date                    12/01/2009
0                       12/01/2009    Description

$matches.Date

12/01/2009
$matches.Text

Description

Each result retrieved by $matches for each sub-expression naturally requires storage space. If you don't need the results, discard them to increase the speed of your regular expression. To do so, type "?:" as the first statement in your sub-expression:

# Don't return a result for the second subexpression:
$pattern = "(?.*)t(?:.*)"

# Generate example line with tab character:
$line = "12/01/2009
tDescription”

# Use a regular expression to parse line:
$line -match $pattern

True
# No more results will be returned for the second subexpression:
$matches

Name                   Value
—-                   —–
Date                   12/01/2009
0                      12/01/2009    Description

Further Use of Sub-Expressions

With the help of results from each sub-expression, you can create surprisingly flexible regular expressions. For example, how could you define a Web site HTML tag as a pattern? A tag always has the same structure: . This means that a pattern for one particular strictly predefined HTML tag can be found quickly:

“contents” -match “]*>(.*?)”

True

$matches[1]

Contents

The pattern begins with the fixed text “body tag, which may consist of any number of any characters (.*?). The expression, enclosed in parentheses, is a sub-expression and will be returned later as a result in$matches so that you’ll know what is inside the body tag. The concluding part of the tag follows in the form of fixed text (”

This regular expression works fine for body tags, but not for other tags. Does this mean that a regular expression has to be defined for every HTML tag? Naturally not. There’s a simpler solution. The problem is that the name of the tag in the regular expression occurs twice, once initially (“”) and once terminally (“”). If the regular expression is supposed to be able to process any tags, then it would have to be able to find out the name of the tag automatically and use it in both locations. How to accomplish that? Like this:

“Contents” -match “<([A-Z][A-Z0-9]*)[^>]*>(.*?)1>”

True

$matches

Name                           Value
—-                           —–
2                              Contents
1                              body
0                              Contents

This regular expression no longer contains a strictly predefined tag name and works for any tags matching the pattern. How does that work? The initial tag in parentheses is defined as a sub-expression, more specifically as a word that begins with a letter and that can consist of any additional alphanumeric characters.

([A-Z][A-Z0-9]*)

The name of the tag revealed here must subsequently be iterated in the terminal part. Here you’ll find “”. “1” refers to the result of the first sub-expression. The first sub-expression evaluated the tag name and so this name is used automatically for the terminal part.

The following RegEx object could directly return the contents of any HTML tag:

$regexTag = [regex]”(?i)]*>(.*?)”
$result = $regexTag.Matches(“Press here”)
$result[0].Groups[2].Value + ” is in tag ” + $result[0].Groups[1].Value

Press here is in tag button

Greedy or Lazy? Detailed or Concise Results…

Readers who have paid careful attention may wonder why the contents of the HTML tag were defined by “.*?” and not simply by “.*” in regard to regular expressions. . After all, “.*” should suffice so that an arbitrary character (char: “.”) can turn up any number of times (quantifier: “*”). At first glance, the difference between “.*” and “.*? is not easy to recognize; but a short example should make it clear.

Assume that you would like to evaluate month specifications in a logging file, but the months are not all specified in the same way. Sometimes you use the short form, other times the long form of the month name is used. As you’ve seen, that’s no problem for regular expressions, because sub-expressions allow parts of a keyword to be declared optional:

“Feb” -match “Feb(ruary)?”

True
$matches[0]

Feb
“February” -match “Feb(ruary)?”

True
$matches[0]

February

In both cases, the regular expression recognizes the month, but returns different results in $matches. By default, the regular expression is “greedy” and wants to achieve a match in as much detail as possible. If the text is “February,” then the expression will search for a match starting with “Feb” and then continue searching “greedily” to check whether even more characters match the pattern. If they do, the entire (detailed) text is reported back.

However, if your main concern is just standardizing the names of months, you would probably prefer getting back the shortest common text. That’s exactly what the “??” quantifier does, which in contrast to the regular expression is “lazy.” As soon as it recognizes a pattern, it returns it without checking whether additional characters might match the pattern optionally.

“Feb” -match “Feb(ruary)?”

True
$matches[0]

Feb
“February” -match “Feb(ruary)?”

True
$matches[0]

Feb

Just what is the connection between the “??” quantifier of this example and the “*?” if the preceding example? In reality, “*?” is not a self-contained quantifier. It just turns a normally “greedy” quantifier into a “lazy” quantifier. This means you could use “?” to force the quantifier “*” to be “lazy” and to return the shortest possible result. That’s exactly what happened with our regular expressions for HTML tags. You can see how important this is if you use the greedy quantifier “*” instead of “*?”, then it will attempt to retrieve a result in as much detail as possible. That can go wrong:

# The greedy quantifier * returns results in as much detail as possible:
“Contents” -match “]*>(.*)”

True
$matches[1]

Contents
# The right quantifier is *?, the lazy one, which returns results that
# are as short as possible
“Contents” -match “]*>(.*?)”

True
$matches[1]

Contents

According to the definition of the regular expression, any characters are allowed inside the tag. Moreover, the entire expression must end with “”. If “” is also inside the tag, the following will happen: the greedy quantifier (“*”), coming across the first “”, will at first assume that the pattern is already completely matched. But because it is greedy, it will continue to look and will discover the second “” that also fits the pattern. The result is that it will take both “” specifications into account, allocate one to the contents of the tag, and use the other as the conclusion of the tag.

I this example, it would be better to use the lazy quantifier (“*?”) that notices when it encounters the first “” that the pattern is already correctly matched and consequently doesn’t go to the trouble of continuing to search. It will ignore the second “” and use the first to conclude the tag.

Finding String Segments

Entire books have been written about the uses of regular expressions. That’s why it would go beyond the scope of this book to discuss more details. However, our last example, which locates text segments, shows how you can use the elements listed in Table 13.11 to easily harvest surprising search results. If you type two words, the regular expression will retrieve the text segment between the two words if at least one word is, and not more than six other words are, between the two words:

“Find word segments from start to end” -match “bstartW+(?:w+W+){1,6}?endb”
True
$matches[0]

Name                           Value
—-                           —–
0                              start to end

Replacing a String

You already know how to replace a string because you were already introduced to the -replace operator. Simply tell the operator what term you want to replace in a string and the task is done:

“Hello, Ralph” -replace “Ralph”, “Martina”

Hello, Martina

But simple replacement isn’t always sufficient, so you need to use regular expressions for replacements. Some of the following interesting examples show how that could be useful.

Perhaps you’d like to replace several different terms in a string with one other term. Without regular expressions, you’d have to replace each term separately. Or use instead the alternation operator, “|”, with regular expressions:

“Mr. Miller and Mrs. Meyer” -replace “(Mr.|Mrs.)”, “Our client”

Our client Miller and Our client Meyer

You can type any term in parentheses and use the “|” symbol to separate them. All the terms will be replaced with the replacement string you specify.

Using Back References

This last example replaces specified keywords anywhere in a string. Often, that’s sufficient, but sometimes you don’t want to replace a keyword everywhere it occurs but only when it occurs in a certain context. In such cases, the context must be defined in some way in the pattern. How could you change the regular expression so that it replaces only the names Miller and Meyer? Like this:

“Mr. Miller, Mrs. Meyer and Mr. Werner”
-replace "(Mr.|Mrs.)s*(Miller|Meyer)", "Our client"

Our client, Our client and Mr. Werner

The result looks a little peculiar, but the pattern you're looking for was correctly identified. The only replacements were Mr. orMrs. Miller and Mr. or Mrs. Meyer. The term "Mr. Werner" wasn't replaced. Unfortunately, the result also shows that it doesn't make any sense here to replace the entire pattern. At least the name of the person should be retained. Is that possible?

This is where the back referencing you've already seen comes into play. Whenever you use parentheses in your regular expression, the result inside the parentheses is evaluated separately, and you can use these separate results in your replacement string. The first sub-expression always reports whether a "Mr." or a "Mrs." was found in the string. The second sub-expression returns the name of the person. The terms "$1" and "$2" provide you the sub-expressions in the replacement string (the number is consequently a sequential number; you could also use "$3" and so on for additional sub-expressions).

"Mr. Miller, Mrs. Meyer and Mr. Werner"
-replace “(Mr.|Mrs.)s*(Miller|Meyer)”, “Our client $2”

Our client , Our client  and Mr. Werner

Strangely enough, at first the back references don’t seem to work. The cause can be found quickly: “$1” and “$2” look like PowerShell variables, but in reality they are regular terms of the -replace operator. As a result, if you put the replacement string inside double quotation marks, PowerShell will replace “$2” with the PowerShell variable $2, which is normally empty. So that replacement with back references works, consequently, you must either put the replacement string inside single quotation marks or add a backtick to the “$” special character so that PowerShell won’t recognize it as its own variable and replace it:

# Replacement text must be inside single quotation marks
# so that the PS variable $2:
“Mr. Miller, Mrs. Meyer and Mr. Werner” -replace
"(Mr.|Mrs.)s*(Miller|Meyer)", 'Our client $2'

Our client Miller, Our client Meyer and Mr. Werner
# Alternatively, $ can also be masked by
$:
“Mr. Miller, Mrs. Meyer and Mr. Werner” -replace
"(Mr.|Mrs.)s*(Miller|Meyer)", "Our client
$2″

Our client Miller, Our client Meyer and Mr. Werner

Putting Characters First at Line Beginnings

Replacements can also be made in multiple instances in text of several lines. For example, when you respond to an e-mail, usually the text of the old e-mail is quoted in your new e-mail as and marked with “>” at the beginning of each line. Regular expressions can do the marking.

However, to accomplish this, you need to know a little more about “multi-line” mode. Normally, this mode is turned off, and the “^” anchor represents the text beginning and the “$” the text ending. So that these two anchors refer respectively to the line beginning and line ending of a text of several lines, the multi-line mode must be turned on with the “(?m)” statement. Only then will -replace substitute the pattern in every single line. Once the multi-line mode is turned on, the anchors “^” and “A”, as well as “$” and “Z”, will suddenly behave differently. “A” will continue to indicate the text beginning, while “^” will mark the line ending; “Z” will indicate the text ending, while “$” will mark the line ending.

# Using Here-String to create a text of several lines:
$text = @”
Here is a little text.
I want to attach this text to an e-mail as a quote.
That’s why I would put a “>” before every line.
“@
$text

Here is a little text.
I want to attach this text to an e-mail as a quote.
That’s why I would put a “>” before every line.

# Normally, -replace doesn’t work in multiline mode.
# For this reason, only the first line is replaced:
$text -replace “^”, “> ”

> Here is a little text.
I want to attach this text to an e-mail as a quote.
That’s why I would put a “>” before every line.

# If you turn on multiline mode, replacement will work in every line:
$text -replace “(?m)^”, “> “

> Here is a little text.
> I want to attach this text to an e-mail as a quote.
> That’s why I would put a “>” before every line.

# The same can also be accomplished by using a RegEx object,
# where the multiline option must be specified:
[regex]::Replace($text, “^”, “> “,
[Text.RegularExpressions.RegExOptions]::Multiline)

> Here is a little text.
> I want to attach this text to an e-mail as a quote.
> That's why I would put a ">" before every line.

# In multiline mode, A stands for the text beginning
#  and ^ for the line beginning:
[regex]::Replace($text, "A", "> ",

[Text.RegularExpressions.RegExOptions]::Multiline)

> Here is a little text.
I want to attach this text to an e-mail as a quote.
That’s why I would put a “>” before every line.

Removing Superfluous White Space

Regular expressions can perform routine tasks as well, such as remove superfluous white space. The pattern describes a blank character (char: “s”) that occurs at least twice (quantifier: “{2,}”). That is replaced with a normal blank character.

“Too   many   blank   characters” -replace “s{2,}”, ” ”

Too many blank characters

Finding and Removing Doubled Words

How is it possible to find and remove doubled words in text? Here, you can use back referencing again. The pattern could be described as follows:

“b(w+)(s+1){1,}b”

The pattern searched for is a word (anchor: “b”). It consists of one word (the character “w” and quantifier “+”). A blank character follows (the character “s” and quantifier “?”). This pattern, the blank character and the repeated word, must occur at least once (at least one and any number of iterations of the word, quantifier “{1,}”). The entire pattern is then replaced with the first back reference, that is, the first located word.

# Find and remove doubled words in a text:
“This this this is a test” -replace “b(w+)(s+1){1,}b”, ‘$1’

This is a test

Regular Expressions in Powershell by Chris Roualin
Like it ! Share it ! Enjoy it !

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

How to whitelist website on AdBlocker?

How to whitelist website on AdBlocker?

  1. 1 Click on the AdBlock Plus icon on the top right corner of your browser
  2. 2 Click on "Enabled on this site" from the AdBlock Plus option
  3. 3 Refresh the page and start browsing the site