Appendix 4. Perl Compatible Regular Expressions

Detailed information on PCRE (Perl Compatible Regular Expressions) can be found in Perl documentation (see http://perldoc.perl.org/perlre.html), in documentation on PCRE used by Parser (see http://www.pcre.org/man.txt), as well as in many other sources which also contain many practical examples. Most detailed information on regular expressions is given in Regular Expressions by J. Friddle, O'Reilly (ISBN 1-56592-257-3).

A draft description given here is only a short reference.

A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern "The quick brown fox" matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in some special way.

There are two different sets of meta-characters:
1.Those that are recognized anywhere in the pattern except within square brackets;  
2.Those that are recognized in square brackets.  

Outside square brackets, the meta-characters are as follows:

\

general escape character with several uses, more detailed description is given later

^

assert start of subject (or line, in multiline mode)

$

assert end of subject (or line, in multiline mode)

.

character class containing all characters; match any character except newline

[...]

character class definition. Matches any of bracketed characters

|

meta-character "OR": allows joining several patterns into one set of alternative matches

(...)

delimit subpattern within general match pattern

?

match 1 non-alphanumeric character

*

match 0 or more of any characters, specified on the left

+

match 1 or more of any characters, specified on the left

{min, max}

minimum/maximum quantifier: require minimum occurrences, allow maximum occurrences.


Part of a pattern that is in square brackets is called a
"character class". In a character class the only meta-characters are:

\

general escape character

^

negate the class, but only if the first character of class definition, any characters but those in class will match

-

indicates character range

[...]

terminates the character class



Backslash usage ("\")

The backslash character has several uses. Firstly, if it is followed by a non-alphameric character, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes. For example, if you want to match a "*" character, you write "\*" in the pattern. This applies whether or not the following character would otherwise be interpreted as a meta-character, so it is always safe to precede a non-alphameric with "\" to specify that it stands for itself. In particular, if you want to match a backslash, you write "\\".

A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. It is usually easier to use one of the following escape sequences than the binary character it represents:

   \a   alarm, that is, the BEL character   
   \cx   "control-x", where x is any character   
   \e   escape, the ASCII character   
   \f   formfeed   
   \n   newline   
   \r   carriage return   
   \t   tab   
   \xhh   character with hex code hh   
   \ddd   character with octal code ddd   

The third use of backslash is for specifying generic character types:

   \d   any decimal digit [0-9]   
   \s   any white space character   
   \w   any "word" character   
   \D \S \W   NOT \d \s \w   

The fourth use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. These assertions may not appear in character classes (but note that "
\b" has a different meaning, namely the backspace character, inside a character class).

   \b   word boundary   
   \B   not a word boundary   
   \A   start of subject (independent of multiline mode)   
   \Z   end of subject or newline at end (independent of multiline mode)   
   \z   end of subject (independent of multiline mode)


Copyright © 1997–2021 Art. Lebedev Studio | http://www.artlebedev.com Last updated: 29.03.2011