Chapter 2. What is a Regular Expression

A regular expression is a way to specify conditions to be fulfilled for a situation in mind. Normally when you search in a text editor you specify the text to search for literally, using a regular expression, on the other hand, you tell what a given match would look like. Examples of this include I'm searching for the word KDE, but only at the beginning of the line, or I'm searching for the word the, but it must stand on its own, or I'm searching for files starting with the word test, followed by a number of digits, for example test12, test107 and test007

You build regular expressions from smaller regular expressions, just like you build large Lego toys from smaller subparts. As in the Lego world, there are a number of basic building blocks. In the following I will describe each of these basic building blocks using a number of examples.

Example 2.1. Searching for normal text.

If you just want to search for a given text, then regular expression is definitely not a good choice. The reason for this is that regular expressions assign special meaning to some characters. This includes the following characters: .*|$. Thus if you want to search for the text kde. (i.e. the characters kde followed by a period), then you would need to specify this as kde\.[1] Writing \. rather than just . is called escaping.

Example 2.2. Matching URLs

When you select something looking like a URL in KDE, then the program klipper will offer to start konqueror with the selected URL.

Klipper does this by matching the selection against several different regular expressions, when one of the regular expressions matches, the accommodating command will be offered.

The regular expression for URLs says (among other things), that the selection must start with the text http://. This is described using regular expressions by prefixing the text http:// with a hat (the ^ character).

The above is an example of matching positions using regular expressions. Similar, the position end-of-line can be matched using the character $ (i.e. a dollar sign).

Example 2.3. Searching for the word the, but not there, breathe or another

Two extra position types can be matches in the above way, namely the position at a word boundary, and the position at a non-word boundary. The positions are specified using the text \b (for word-boundary) and \B (for non-word boundary)

Thus, searching for the word the can be done using the regular expression \bthe\b. This specifies that we are searching for the with no letters on each side of it (i.e. with a word boundary on each side)

The four position matching regular expressions are inserted in the regular expression editor using four different positions tool

Example 2.4. Searching for either this or that

Imagine that you want to run through your document searching for either the word this or the word that. With a normal search method you could do this in two sweeps, the first time around, you would search for this, and the second time around you would search for that.

Using regular expression searches you would search for both in the same sweep. You do this by searching for this|that, i.e. separating the two words with a vertical bar.[2]

In the regular expression editor you do not write the vertical bar yourself, but instead select the alternative tool, and insert the smaller regular expressions above each other.

Example 2.5. Matching anything

Regular expressions are often compared to wildcard matching in the shell - that is the capability to specify a number of files using the asterisk. You will most likely recognize wildcard matching from the following examples:

  • ls *.txt - here *.txt is the shell wildcard matching every file ending with the .txt extension.

  • cat test??.res - matching every file starting with test followed by two arbitrary characters, and finally followed by the test .res

In the shell the asterisk matches any character any number of times. In other words, the asterisk matches anything. This is written like .* with regular expression syntax. The dot matches any single character, i.e. just one character, and the asterisk, says that the regular expression prior to it should be matched any number of times. Together this says any single character any number of times.

This may seem overly complicated, but when you get the larger picture you will see the power. Let me show you another basic regular expression: a. The letter a on its own is a regular expression that matches a single letter, namely the letter a. If we combine this with the asterisk, i.e. a*, then we have a regular expression matching any number of a's.

We can combine several regular expression after each other, for example ba(na)*. [3] Imagine you had typed this regular expression into the search field in a text editor, then you would have found the following words (among others): ba, bana, banana, bananananananana

Given the information above, it hopefully isn't hard for you to write the shell wildcard test??.res as a regular expression. Answer: test..\.res. The dot on its own is any character. To match a single dot you must write \.[4]. In other word, the regular expression \. matches a dot, while a dot on its own matches any character.

In the regular expression editor, a repeated regular expression is created using the repeat tool

Example 2.6. Replacing & with & in a HTML document

In HTML the special character & must be written as & - this is similar to escaping in regular expressions.

Imagine that you have written an HTML document in a normal editor (e.g. XEmacs or Kate), and you totally forgot about this rule. What you would do when realized your mistake was to replace every occurrences of & with &.

This can easily be done using normal search and replace, there is, however, one glitch. Imagine that you did remember this rule - just a bit - and did it right in some places. Replacing unconditionally would result in & being replaced with &

What you really want to say is that & should only be replaced if it is not followed by the letters amp;. You can do this using regular expressions using positive lookahead.

The regular expression, which only matches an ampersand if it is not followed by the letters amp; looks as follows: &(?!amp;). This is, of course, easier to read using the regular expression editor, where you would use the lookahead tools.

[1] The regular expression editor solves this problem by taking care of escape rules for you.

[2] Note on each side of the vertical bar is a regular expression, so this feature is not only for searching for two different pieces of text, but for searching for two different regular expressions.

[3] (na)* just says that what is inside the parenthesis is repeated any number of times.

[4] This is called escaping