Patterns

Patterns consists of literal strings and character classes. Patterns may contain sub-patterns, which are patterns enclosed in parentheses.

Escaping characters

In patterns as well as in character classes, some characters have a special meaning. To literally match any of those characters, they must be marked or escaped to let the regular expression software know that it should interpret such characters in their literal meaning.

This is done by prepending the character with a backslash (\).

The regular expression software will silently ignore escaping a character that does not have any special meaning in the context, so escaping for example a j (\j) is safe. If you are in doubt whether a character could have a special meaning, you can therefore escape it safely.

Escaping of course includes the backslash character itself, to literally match a such, you would write \\.

Character Classes and abbreviations

A character class is an expression that matches one of a defined set of characters. In Regular Expressions, character classes are defined by putting the legal characters for the class in square brackets, [], or by using one of the abbreviated classes described below.

Simple character classes just contains one or more literal characters, for example [abc] (matching either of the letters a, b or c) or [0123456789] (matching any digit).

Because letters and digits have a logical order, you can abbreviate those by specifying ranges of them: [a-c] is equal to [abc] and [0-9] is equal to [0123456789]. Combining these constructs, for example [a-fynot1-38] is completely legal (the last one would match, of course, either of a,b,c,d, e,f,y,n,o,t, 1,2,3 or 8).

As capital letters are different characters from their non-capital equivalents, to create a caseless character class matching a or b, in any case, you need to write it [aAbB].

It is of course possible to create a negative class matching as anything but To do so put a caret (^) at the beginning of the class:

[^abc] will match any character but a, b or c.

In addition to literal characters, some abbreviations are defined, making life still a bit easier:

\a

This matches the ASCII bell character (BEL, 0x07).

\f

This matches the ASCII form feed character (FF, 0x0C).

\n

This matches the ASCII line feed character (LF, 0x0A, Unix newline).

\r

This matches the ASCII carriage return character (CR, 0x0D).

\t

This matches the ASCII horizontal tab character (HT, 0x09).

\v

This matches the ASCII vertical tab character (VT, 0x0B).

\xhhhh

This matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (i.e., \zero ooo) matches the ASCII/Latin-1 character corresponding to the octal number ooo (between 0 and 0377).

. (dot)

This matches any character (including newline).

\d

This matches a digit. Equal to [0-9]

\D

This matches a non-digit. Equal to [^0-9] or [^\d]

\s

This matches a whitespace character. Practically equal to [ \t\n\r]

\S

This matches a non-whitespace. Practically equal to [^ \t\r\n], and equal to [^\s]

\w

Matches any word character - in this case any letter or digit. Note that underscore (_) is not matched, as is the case with perl regular expressions. Equal to [a-zA-Z0-9]

\W

Matches any non-word character - anything but letters or numbers. Equal to [^a-zA-Z0-9] or [^\w]

The abbreviated classes can be put inside a custom class, for example to match a word character, a blank or a dot, you could write [\w \.]

Note

The POSIX notation of classes, [:<class name>:] is currently not supported.

Characters with special meanings inside character classes

The following characters has a special meaning inside the [] character class construct, and must be escaped to be literally included in a class:

]

Ends the character class. Must be escaped unless it is the very first character in the class (may follow an unescaped caret)

^ (caret)

Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.

- (dash)

Denotes a logical range. Must always be escaped within a character class.

\ (backslash)

The escape character. Must always be escaped.

Alternatives: matching one of

If you want to match one of a set of alternative patterns, you can separate those with | (vertical bar character).

For example to find either John or Harry you would use an expression John|Harry.

Sub Patterns

Sub patterns are patterns enclosed in parentheses, and they have several uses in the world of regular expressions.

Specifying alternatives

You may use a sub pattern to group a set of alternatives within a larger pattern. The alternatives are separated by the character | (vertical bar).

For example to match either of the words int, float or double, you could use the pattern int|float|double. If you only want to find one if it is followed by some whitespace and then some letters, put the alternatives inside a subpattern: (int|float|double)\s+\w+.

Capturing matching text (back references)

If you want to use a back reference, use a sub pattern to have the desired part of the pattern remembered.

For example, if you want to find two occurrences of the same word separated by a comma and possibly some whitespace, you could write (\w+),\s*\1. The sub pattern \w+ would find a chunk of word characters, and the entire expression would match if those were followed by a comma, 0 or more whitespace and then an equal chunk of word characters. (The string \1 references the first sub pattern enclosed in parentheses)

Note

To avoid ambiguities with usage of \1 with some digits behind it (e.g. \12 can be 12th subpattern or just the first subpattern with 2) we use \{12} as syntax for multi-digit subpatterns.

Examples:

  • \{12}1 is use subpattern 12

  • \123 is use capture 1 then 23 as the normal text

Lookahead Assertions

A lookahead assertion is a sub pattern, starting with either ?= or ?!.

For example to match the literal string Bill but only if not followed by Gates, you could use this expression: Bill(?! Gates). (This would find Bill Clinton as well as Billy the kid, but silently ignore the other matches.)

Sub patterns used for assertions are not captured.

See also Assertions

Characters with a special meaning inside patterns

The following characters have meaning inside a pattern, and must be escaped if you want to literally match them:

\ (backslash)

The escape character.

^ (caret)

Asserts the beginning of the string.

$

Asserts the end of string.

() (left and right parentheses)

Denotes sub patterns.

{} (left and right curly braces)

Denotes numeric quantifiers.

[] (left and right square brackets)

Denotes character classes.

| (vertical bar)

logical OR. Separates alternatives.

+ (plus sign)

Quantifier, 1 or more.

* (asterisk)

Quantifier, 0 or more.

? (question mark)

An optional character. Can be interpreted as a quantifier, 0 or 1.