The Real World
Enter The Matrix
Regular expressions, also known as "regex" by the geek community, are a powerful tool used in pattern-matching and substitution. They are commonly associated with almost all *NIX-based tools, including editors like vi, scripting languages like Perl and PHP, and shell programs like awk and sed.
A regular expression lets you build patterns using a set of special characters; these patterns can then be compared with text in a file, data entered into an application, or input from a form filled up by users on a Web site. Depending on whether or not there's a match, appropriate action can be taken, and appropriate program code executed.
For example, one of the most common applications of regular expressions is to check whether or not a user's email address, as entered into an online form, is in the correct format; if it is, the form is processed, whereas if it's not, a warning message pops up asking the user to correct the error. Regular expressions thus play an important role in the decision-making routines of Web applications - although, as you'll see, they can also be used to great effect in complex find-and-replace operations.
A regular expression usually looks something like this:
All this does is match the pattern "matrix" in the text it's applied to. Like many other things in life, it's simpler to get your mind around the pattern than the concept - but then, that's neither here nor there...
How about something a little more complex? Try this:
This would match the words "matting" and "mattress", but not "matrix". Why? Because the "+" character is used to match one or more occurrence of the preceding character - in the example above, the characters "ma" followed by one or more occurrence of the letter "t".
Similar to the "+" meta-character (that's the official term), we have "*" and "?" - these are used to match zero or more occurrences of the preceding character, and zero or one occurrence of the preceding character, respectively. So,
would match "easy", "egocentric" and "egg"
would match "Winnie", "Wimpy" "Wilson" and "William", though not "Wendy" or "Wolf".
In case all this seems a little too imprecise, you can also specify a range for the number of matches. For example, the regular expression
would match "jimmy" and "jimmmmmy!", but not "jim". The numbers in the curly braces represent the lower and upper values of the range to match; you can leave out the upper limit for an open-ended range match.
Two To Tango
When you run this script, you should see the following:
Sorry, Trinity is not in The Matrix.
The search() method returns the position of the substring matching the regular expression, or -1 if no match exists. In the example above, it is clear that the pattern "trinity" does not exist in the string "The Matrix"; hence, the error message.
Now, look what happens when I update the regular expression so that it results in a positive match:
Trinity located in The Matrix at character 7
Game, Set, Match
The String object also comes with a match() method, which can be considered a close cousin of the search() method above. What's the difference? Well, you've already seen that the search() method returns the position where a match was found. The match() method does things a little differently - it applies a regex pattern to a string and returns the values matched in an array.
Confused? Take a look at the next example
View this example in a browser, and you'll get an alert message displaying the first matching result, as shown below:
Match #1: iss
In the example above, I have defined a regular expression "is.". This will match the string "is", followed by any other character (the "." operator at the end of the pattern matches anything and everything in a string). If you look at the string to be searched, you'll see that there are two occurrences of this pattern. However, the code above only returns 1. Why?
The answer is simple - I've "forgotten" to add the "g" (for "global") modifier to the pattern. As a result, searching stops after the first match. Consider the next example, which revises the previous code listing to add this operator:
And now, when you try out this example, you should see two alert boxes, indicating that two matches to the specified pattern were found in the string. The additional "g" modifier ensures that all occurrences of a pattern in a string are matched, and stored in the return array. I'll show you a few other useful modifiers as we proceed through this tutorial.
Search and Destroy
The previous set of examples highlighted the search capabilities of the String object. But that's not all - you can also perform a search-and-replace operation with the replace() method, which accepts both a regular expression and the value to replace it with. Here's how:
If you load this example in a browser, you will see that the string "Anderson" has been replaced with the string "Smith". The following output illustrates:
Welcome to the Matrix, Mr. Smith
Remember how I used the "g" modifier to search for multiple instances of a pattern within a string? Take it one step further - you can even use it to replace multiple instances of a pattern within the string:
Here, the \s metacharacter matches the space after "yo" and "ho" and replaces with "oo".
You can also use case-insensitive pattern matching - simply add the "i" modifier (for "insensitive") at the end of the pattern. The next example shows you how:
The String object also comes with a split() method, which can be used to decompose a single string into separate units on the basis of a particular separator value; these units are then placed into an array for further processing. Consider the following example, which demonstrates:
To understand this better, consider the following string, which illustrates a common problem - unequal whitespace between separated values:
Neo| Trinity |Morpheus | Smith| Tank
Here, the | character is used to separate the various names. However, the space between the various | is unequal - which means that before you can use the individual elements of the string, you will need to trim the additional space around them. Splitting by using a regular expression as the separator is an elegant solution to the problem - as you can see from the updated listing below:
The output of the call to split() above will be an array containing the names, without any leading or trailing spaces.
Objects In The Rear-View Mirror
This RegExp object comes with three useful methods - take a look:
test() - test a string for a match to a pattern
exec() - returns an array of the matches found in the string, and also permits advanced regex manipulation
compile() - alter the regular expression associated with a RegExp object
Let's look at a simple example:
This is similar to one of the very first examples in this tutorial. However, as you can see, I've adopted a completely different approach here.
The primary difference here lies in my creation of a RegExp object for my regular expression search. This is accomplished with the "new" keyword, followed by a call to the object constructor. By definition, this constructor takes two parameters: the pattern to be searched for, and modifiers if any (I've conveniently skipped these in the example above).
Once the RegExp object has been created, the next step is to use it. Here, I've used the test() method to look for a match to the pattern. By default, this method accepts a string variable as a parameter and compares it against the pattern passed to the RegExp object constructor. If it finds a match, it returns true; if it does not, it returns false. Obviously, this is a more logical implementation than the search() feature of the String object.
One Mississippi, Two Mississippi...
The next method I'll show you is the exec() method. The behavior of this method is similar to that of the String object's match() method. Take a look:
The exec() method returns a match to the supplied regular expression, if one exists, as an array; you can access the first element of the array to retrieve the matching substring, and the location of that substring with the index() method.
The main difference between the match() and exec() methods lies in the parameters passed - the former requires a pattern as argument, while the latter requires the string variable to be tested.
However, that's not all. The exec() method has the ability to continue searching within the string for the same pattern without requiring you to use the "g" modifier. Let me tweak the above example to demonstrate this feature:
So what do we have here? For starters, I have used a "while" loop to call the exec() method repeatedly, until it reaches the end of the string (at which point the object will return null and the loop will terminate). This is possible because every time you call exec(), the RegExp object continue to search from where it left off in the previous iteration.
At least that's the theory - the code above doesn't work as advertised in either Internet Explorer or Netscape Navigator, so you should be careful when using it. Consider the above a purely theoretical example, then...at least until the browser makers fix the bug.
Another interesting point to note in the example above is my definition of the RegExp object. Unlike the previous example, you will notice that I have not used the constructor or the "new" keyword to create the object; instead, I've simply assigned the pattern to a variable. Think of this as a shortcut technique for creating a new RegExp object.
Changing Things Around
You may have noticed from the previous examples that when using a RegExp object, you have to specify the regular expression at the time of constructing the object. So you might be wondering to yourself, what happens if I need to change the pattern at a later time?
Notice the use of the compile() method to dynamically update the pattern associated with the RegExp object.
Working with Forms
Now that you know how it all works, let's look at a practical example of how you can put this knowledge to good use. Consider the following example, which displays an HTML form asking the user for credit card and email information to complete a purchase.
You'll notice, in the example above, that I've used numerous regular expressions to verify that the data being entered into the form by the user is of the correct format. This type of client-side input validation is extremely important on the Web, to ensure that the data you receive is accurate, and in the correct format.
Over And Out
To close this article, I developed a simple example that demonstrates the use of complex regular expressions to validate form input - a routine task in all Web-based applications. If you do this often, it makes sense for you to build a good library of regular expressions for common validations (if you already have one, send me some mail and tell me all about it).
Here are some additional URLs to help you understand the concept of regular expressions further:
Stringing Things Along, at http://www.melonfire.com/community/columns/trog/article.php?id=173
Pattern Matching and Regular Expressions, at http://www.webreference.com/js/column5/
That's all for this article. See you soon!
Note: Examples are illustrative only, and are not meant for a production environment. Melonfire provides no warranties or support for the source code described in this article. YMMV!This article was first published on 18 Dec 2003.