HOME
INDEX
REGULAR
EXPRESSIONS
  Examples

  Class 100
  Class 101

Regular Expressions - Class 100


Contents

  • Strings and Substrings
  • Compare Rule Conditions and Regexs
  • The Regex Wild Card
  • Metacharacters
  • The Regex "OR"
  • Regex "OR" with Characters
  • Brackets [ ]
  • Brackets and Lists
  • Brackets and List Ranges
  • Brackets and Multiple List Ranges
  • Negation within Brackets
NOTES: Before beginning you might want to read about testing regular expressions. If you can't see the fixed fonts clearly your browser will allow you to alter the size of fonts. Or, to adjust the default sizes permanently, adjust your fonts in preferences or internet options. Change the fixed font to a larger size.

If you ended up here from an outside link be aware that this regular expression discussion is specific to Gravity. While some examples will work with Perl compatible expressions, others will not.


Strings and Substrings

Programmers won't need to be told this but a word (which is a text string) consists of a group of characters added (concatenated) together.

Look at the following Usenet subject string ...

   Re: Cat and Dog

It is composed of these characters ...

Capital R, lower case e, colon, space

Stop here .. note that spaces are characters just like any other character. They must be accounted for.

This discussion may seem silly at this point but it is important to think this way to be able to target strings. You must think character by character by character. Each character is in itself a regular expression.

Within the subject string there are substrings. In our example above a substring could be "Cat". It could just as easily be "t a." (see it at the end of cat ? it is "t" followed by space followed by "a"). Not to belabor the point but a substring could be "nd D." In practise, you probably would not be looking for these small, meaningless strings, they are presented for example purposes.

Numbers in a subject such as (5/300) are not treated as numeric values but rather just like any other string. In other words, this example consists of the following characters: parenthese followed by a character 5 followed by a forward slash followed by the character 3 (not the number 300), followed by the character 0, etc ......

Gravity's "normal" rule conditions find substrings without regard to case. Regular expressions allow us to specify case, match patterns or place other conditions on our target substring.


Compare Rule Conditions and Regexs

Let's suppose you want to find articles containing "cat" in the subject. You can use a rule condition.
   Subject contains "cat"
Or you could use a regular expression.
   Subject contains reg. expr. "cat"

Is there any difference?

No, not in this case. Both of these constructions will match the same article subjects. In Gravity, a regular expression outside of brackets [ ] is not case sensitive. So, in a simple case like this where you are not using case sensitivity, wild cards or other specifications, they are not really needed.

What if you want to find words like "cat" "cot" "cut"?


The Regex Wild Card

The dot "." acts as a wild card for one character.
   Subject contains reg. expr. "c.t"
Will match cat, cut, cot or, for that matter, cft or c t (that was a space) because the wild card is just that, it will match any character including letters, spaces, punctuation or numbers. Lets take it further.
   Subject contains reg. expr. "c..t"

Will match a c, followed by 2 of any characters (including spaces) followed by t. This would match colt or coat. You also can write it like this with the replication operator {}.

   Subject contains reg. expr. "c.{2}t"

The wild card can be extremely useful when combined with the * or +. You will learn this in Class 101.


Metacharacters

It would be negligent at this point not to mention the "metacharacters." These are simply characters that have a special meaning when used within a regular expression. Remember the regex wildcard, the . (period)? If you want to find a period (like in a file extension) and enter a dot it acts like a wild card. To make a regex truly find a dot and not any character it must be escaped with a backslash. To find subjects containing the file extensions .jpg you use something like this.
   Subject contains reg. expr. "\.jpg"

Many metacharacters lose their special status when enclosed in brackets. I was going to do more about metacharacters later.


The Regex "OR"

The pipe symbol ( ¦ or vertical slash) acts as OR.

To look for "cat" or "dog" with regexs you can enter ..

   Subject contains reg. expr. "cat|dog"

It may be more correct (but not necessary) to include parentheses like the following.

   Subject contains reg. expr. "(cat)|(dog)"

The equivalent rule without regex is ..

  Subject contains "cat"
  OR
  Subject contains "dog"

So, whats the advantage ? Not much in this example. But, the regex OR is useful with large expressions and also for characters. Instead of writing this without regexes:

  Subject contains "moe"
  OR
  Subject contains "larry"
  OR
  Subject contains "curley"

We can write the following much more attractive regex:

  Subject contains reg. expr. "moe|larry|curley"

Regex "OR" With Characters

Ok, now I used a wild card c.t and found too much stuff. We will limit it to finding "cat" or "cot"
   Subject contains reg. expr. "c(a|o)t"

This regex will match a "c" followed by EITHER an "a" OR an "o" followed by a "t." NOTE: The parentheses are necessary and used for grouping. A "ca|ot" is not the same here. Looking for "ca|ot" would match any string containing either a "ca" or "ot" and you would get quite a few false hits.

NOTE: This construction is NOT case sensitive. You will match words like:

cat, caT, Cot, COT

Now, what if we want to find capital letters ? To match case, we need to look at brackets.

(Note that case handling is different in Perl and Javascript regular expressions. But, if you aren't going to use those languages, don't worry about it.)

Brackets

The brackets are quite cool and account for many of the bizarre-looking regexes you see. A set of brackets can contain a single character. If the character is in brackets it is now case sensitive. Example..
   Subject contains reg. expr. "[c]at"

This will find any words containing "cat"
It will NOT match "Cat"

   Subject contains reg. expr. "[C]at"

Will match "Cat" NOT "cat." This is how regexs are used to specify case and it is very useful. You cannot do this with Gravity's text rule expressions.

Lets say you want to find (maybe to filter out) some junk which always begins with "FREE" in all capital letters.

   Subject contains reg. expr. "[F][R][E][E]"

TIP: Many metacharacters lose their special meaning and behave like other characters when placed inside brackets.


Brackets And Lists

You can put more than one character inside brackets.
   Subject contains reg. expr. "[cC]at" 

Now everything inside the brackets counts as ONE character separated by OR statements. This will find "Cat" OR "cat" (or cAT, remember anything outside the brackets is not case sensitive). The astute reader probably noticed that this construction will hit the same target as entering a rule condition string of "cat" without using regexes. It is just an example.

Lets go back to our earlier wildcard example to find cat, cut, or cot. Instead of the wildcard, which would match anything, we can be more specific like this:

   Subject contains reg. expr. "c[auo]t"

Remember though, that this is matching lower case letters because of the brackets. To match upper or lower case use this:

   Subject contains reg. expr. "c[aAuUoO]t"

Here is a better example:

   Subject contains reg. expr. "[bch]at"

Will match "bat" "cat" or "hat."

It is important to grasp the concept that the entire bracket counts as one character. So the last expression will match three characters, beginning with one of the bracketed characters, followed by a, followed by t. (You can change this behavior and find more repetitions with various replication operators.)

Be aware that these simple examples may not be finding whole words. So words like "battle" "cattle" "redhat" would also match the above example. Many times this error level is of no concern but keep it in mind if you get unexpected results. I show how to use Gravity's regexes to match a whole word only in Class 101.


Brackets and List Ranges

This is way cool and extremely important. You can specify a range within a list. The range can be
numbers, 0-9
lower case letters, a-z
upper case letters, A-Z

   Subject contains reg. expr. "[0-9]"

Will match any of these SINGLE characters ... 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9

   Subject contains reg. expr. "[1-4]"

Will match 1, 2, 3, 4

   Subject contains reg. expr. "[a-d]"

Will match a, b, c, or d (lower case only).

   Subject contains reg. expr. "[A-D]"

Will match A, B, C, or D (upper case only).

   Subject contains reg. expr. "c[a-z]t"

Will match strings like cat, cbt, cct, cdt, cet, ... czt. Get it? We have a "c" followed by any lower case letter followed by a "t". Now, to see if you are getting any of this, would this construction match ..

So, to match a character of any upper case letter but exclude lowercase spaces, digits or other punctuation. You would write:

   "[A-Z]"

NOTE: the ranges I listed above are only the ones most commonly used because they are obvious and easy to remember. The ranges are really continuous ASCII character ranges. The lower case, upper case, and integer characters are continuous. Look at any ASCII chart and you will see that you could use a range of [!-~] and match lower case, upper case, numbers and most punctuation marks! But, this is beyond the scope of our 100 class.


Brackets and Multiple List Ranges

You can combine lists (and single regexes).
   Subject contains reg. expr. "[a-zA-Z]"

Will match any lower case or upper case letter.

   Subject contains reg. expr. "[a-zA-Z0-9]"

Will match any lower case letter, upper case letter or digit.

   Subject contains reg. expr. "[0-9 -]"

Will match a digit, space, or dash.

   Subject contains reg. expr. "[a-c3-6]"

Will match an a, b, c, 3, 4, 5, or 6.

   Subject contains reg. expr. "c[a-zA-Z0-9]t"

Will match ...


Negation

The caret ^ (or is it a circumflex?), when used as the first character in brackets, means match anything except what is inside the brackets.
   Subject contains reg. expr. "c[^0-9]t"

Will match a c, followed by anything except a digit, followed by a t.

I'M READY FOR CLASS 101 !!!

Back to top of page