Regular Expressions - Class 101
NOTES: Before beginning you might want to read about testing regular expressions. If you can't see the fixed fonts clearly your browser will allow you to alter the size of fonts. Or, to adjust the default sizes permanently, adjust your fonts in preferences or internet options. Change the fixed font to a larger size.
Welcome Class 100 graduates! I see a few hearty souls are ready to press on.
The regex * means "zero or more occurrences." The star is a suffix operator meaning it follows the expression. Another way of looking at this is that the star means the preceding expression is OPTIONAL.Subject contains reg. expr. "co*at"
Will match "coat" OR "cat." Note that it would conceivably match coooooooooat because the match to "o" may occur zero or MORE (any number to unlimited) times.
The parentheses can group expressions like this:Subject contains reg. expr. "(co)*at"
Now we may have an optional phrase of "co" occurring zero or more times. A common use is the following form which allows you to find a string within a thread, including replies.Subject contains reg. expr. "<0>(re: )*cat"
The <0> we haven't gone into but it means occurring at the beginning of the line, more on that later. The significant part is that the (re: ) is optional because of the star. This will match a subject starting with "cat" and also any replies like "Re: cat"
You can match lists, like zero or more digits.Subject contains reg. expr. "[0-9]*"
Or, zero or more upper or lower case characters.Subject contains reg. expr. "[a-zA-Z]*"
Lets look at one of the examples from my examples page. We will use the star to match 'Tom' or 'Thomas.' The star indicates that the "h" and "as" are permitted but not necessary.Subject contains reg. expr. "th*om(as)*"Subject contains reg. expr. "th*oma*s*"
Since we are looking for a name we could specify a capital T like the following.Subject contains reg. expr. "[T]h*om(as)*"
Very useful effects can be obtained with the wild card (period).Subject contains reg. expr. "z.*z"
Will match a line containing at least two z characters anywhere in the line. They can be contiguous or not. This means "match a "z" followed by anything or nothing occurring zero or any number of times, followed by another "z"
You can use this method to find two phrases or words separated by anything (or nothing).Subject contains reg. expr. "Toms.*page"Would match "Toms Page" "Toms verbose page" or "Tomspage."
The plus sign, or +, is almost identical in function to the star, *, with one critical difference; the + means match one or more of the preceding expressions. There must be at least one expression present for the match. It could be a single expression like a number or character, or a combination grouped with parentheses. Therefore, to find multi-part binary articles we could use the following.Subject contains reg. expr. "[0-9]+/[0-9]+"
The above expression means find subject lines containing one or more of the numbers 0 through 9, followed by a forward slash, followed by one or more of the numbers 0 through nine. In case you don't recognize it, this expression is commonly used to identify multi-part binaries.
Let's look at another example from my regex examples page.Subject contains reg. expr. "tal+ah+as+ee"
This construction catches mis-spellings of Tallahassee by matching one or more of the letters l, h and s.
Remember you can group expressions with parentheses and the operator will apply to the group. You can use other operators within the parentheses such as the following.Subject contains reg. expr. "(e|i)+"
Will match strings containing one or more of the letters e or i. This expression by itself is of dubious value but would likely be included within a larger expression. In fact, it would probably match every article in your group because most subjects contain at least one e or i.
As with the star, very useful effects can be obtained with the wild card (period).Subject contains reg. expr. "z.+z"
Will match a line containing at least two z characters anywhere in the line. They can be contiguous or not. This means "match a "z" followed by anything one to any number of times, followed by another "z." The only difference between this construction and the star is that there must be something (i.e. space, character) between the expressions.
You can use this method to find two phrases or words separated by anything.Subject contains reg. expr. "Toms.+page"Would match "Toms Page" or "Toms verbose page." It would NOT match "Tomspage."
This construction will match the beginning of the line. An example should clarify things.Subject contains reg. expr. "<0>big"
This regex rule condition will match articles with subjects that begin with the word (string) "big." Big must be the first word in the string.
The following is a very useful syntax. It will match a string beginning with the word "big" and also followup articles.Subject contains reg. expr. "<0>(re: )*big"
This means match start of line, followed by zero or more "re: " (note the space following the colon) expressions, followed by the expression big. ("big" is really made up of three regexes b, i, and g.)
This may not be technically accurate but you can think of the entire <0> construction as a character.
It should be apparent that this one is the opposite of the former topic. It is placed at the end of the target expression.Subject contains reg. expr. "code<~0>"
Will match an article subject which ends with the word "code." Note that a subject ending with "decode" would also match. A very useful trick to avoid hitting other, larger strings containing our word at end of line is to add a space before our target word. This trick works well for words at the beginning of line also. here is an example:Subject contains reg. expr. " code<~0>"
One of the things I tried to find was how to match complete words only. You may be tempted to add a space before and after the word but it will fail at the beginning or end of line. The best solution I have found is the following construction, using target as the desired word to match.Subject contains reg. expr. "(<0>| +)target( +|<~0>)"
Remember I said you can think of <0> and <~0> as characters? The above expression means match beginning of line OR a space followed by the expressions "target" followed by a space OR the end of line.
Back to top of page