Причудливый QRegExp

Материал из Wiki.crossplatform.ru

Перейти к: навигация, поиск
Image:qt-logo_new.png Image:qq-title-article.png
Qt Quarterly | Выпуск 1 | Документация

Содержание

[править] Причудливый QRegExp

by Jasmin Blanchette

C++ programmers can use Perl-style regular expressions with Qt 3.0's completely rewritten QRegExp class. We present some of the practicalities of using regular expressions for parsing, validation, and filtering. We also look at how to optimize regular expressions, and finish off with some weird trivia.

[править] Parsing

Yacc's own input files are best parsed with a more powerful parser than Yacc.
-- John Aycock

Parsing text files is a common problem in everyday programming. Writing full-blown parsers or using parser generators is often time consuming and difficult. Many problems can be solved with regular expressions that are fast and fun to write, and easy to adapt to changing specifications. But beware, regular expressions have some traps for the unwary, as we will see in our first example.

One common task is to find balanced pairs such as opening and closing parentheses or HTML tags. For example, this regular expression matches emphasized or italicized text in a HTML document:

 
 QRegExp rx( "<(em|i)>.*</\\1>" );
 rx.setCaseSensitive( FALSE );

The regular expression <(em|i)>.*</\\1> means "match <, then em or i, then >, then zero or more arbitrary characters, then </, then em or i (whichever matched earlier), and finally >." HTML actually allows for whitespace within tags, so a better regular expression is

 
 QRegExp rx( "<\\s*(em|i)\\s*>.*<\\s*/\\s*\\1\\s*>" );

By default, regular expressions are greedy, matching as much text as they can. For example, on the string "a red and blue flag", the above regular expression would match "red and blue" rather than just "red". The solution is to call

 rx.setMinimal( TRUE );
to instruct [documentation/qtdoc4.3/qregexp.php QRegExp] to match as little as possible, i.e. "red". An alternative that doesn't require setMinimal() and gives similar results is to replace .* with [^<]*, which will match anything except a < rather than anything at all. Unfortunately both solutions are fragile because they fail on nested pairs, for example:
 
 some <u><i><i>very</i></u> italicized</i> text

This is a general problem with regular expressions, and one of the reasons why full parsers are needed in some situations. On the other hand, if you can guarantee that the HTML is free from such constructs (possibly because the HTML is generated by your application), you can very simply parse HTML using regular expressions like the one above.


Applications can easily read and write their settings using QSettings, which internally uses configuration files or the Windows registry. But quite often we want to read and write configuration files in a custom file format. [documentation/qtdoc4.3/qregexp.php QRegExp] is useful for these situations. Many configuration files, including qmake's .pro files, contain entries of the form "key = value," for example:

 # calc.pro
 SOURCES = appwindow.cpp \
       main.cpp
 TARGET  = calc
 

Parsing these files is non-trivial, because the format allows arbitrary amounts of whitespace, and can have entries that span multiple lines (using backslash). We will write a simple .pro parser using [documentation/qtdoc4.3/qregexp.php QRegExp].

 
 QFile in( "calc.pro" );
 if ( !in.open(IO_ReadOnly) )
      return;
 QString text = in.readAll();
 in.close();

First we read in the file, here calc.pro, into a QString.

 
 text.replace( QRegExp("#[^\n]*"), "" );
 text.replace( QRegExp("\\\\\\s*\n"), " " );
 text.replace( QRegExp("\\s*\n\\s*"), "<END>" );
 text.replace( QRegExp("="), " = " );
 text = text.simplifyWhiteSpace();

Then we convert the file into a canonical form. Comments are stripped, and multi-line entries are converted into single lines, with each escaped line terminator replaced with a single space. We then ensure that there is a single space between each token. Every entry has an <END> marker added at the end. (We cannot simply keep the original \n's because QString::simplifyWhiteSpace() replaces them with spaces.)

 
 QStringList lines = QStringList::split( "<END>", text );
 QStringList::Iterator line = lines.begin();
 while ( line != lines.end() ) {
 	QStringList toks = QStringList::split( " ", *line );
 	QStringList::Iterator it = toks.begin();
 
 	if ( toks.count() >= 3 && toks[1] == "=" ) {
 	    QString key = *it++;
 	    it++;
 	    while ( it != toks.end() ) {
 		process_value( key, *it++ );
 	    }
 	}
 	++line;
 }

We split the whole text into lines and then split each line into tokens. We take the first token as the key, ignore the '=' token, and use the remaining tokens as values. And that's all: a simple .pro file parser in about 20 lines of code.


Our third application of [documentation/qtdoc4.3/qregexp.php QRegExp] is a program that converts // comments into C-style /*...*/ comments. Such a program would be useful for porting C++ source code to C.

It is tempting to write this:

 
 text.replace( QRegExp("//([^\n]*)(\n|$)"), "/*\\1 */\\2" );

Unfortunately this doesn't work. First, Qt 3.0 does not support the back-reference syntax (\\1, \\2, ...) in replacement text; this will be introduced in Qt 3.1. Second, the regular expression will wrongly replace occurrences of // in /*...*/ comments and in string literals.

A more sophisticated approach uses the following regular expression:

 
 QRegExp rx( "(//([^\n]*)(\n|$)|/\\*.*\\*/|\"([^\\\\]|\\\\.)*\"" );
 rx.setMinimal( TRUE );

The regular expression consists of three alternate parts: //([^\n]*)(\n|$) matches a C++-style comment, /\\*.*\\*/ matches a C-style comment, and \"([^\\\\]|\\\\.)*\" matches a literal string. Minimal matching is used for the same reason as in the HTML example presented earlier; without it, we might match the /* of one comment and the */ of another.

The regular expression contains three pairs of parentheses. The text matched by these is available as QRegExp::cap(1), cap(2), and cap(3). In this case, we will only make use of cap(1), and could use non-capturing parentheses (?:...) instead of (...) in the other two cases.

We can use the regular expression in a loop as follows:

 
 
 int pos = 0;
 while ( (pos = rx.search(text, pos)) != -1 ) {
 if ( rx.cap(1).startsWith("//") ) {
     QString before = rx.cap( 1 );
     QString after = "/*" + before.mid( 2 ) + " */";
     text.replace( pos, before.length(), after );
     pos += after.length();
 } else {
     pos += rx.matchedLength();
 }
 }

This will work in all cases.*

* This will fail if the // comment contains */. For such rare cases, hand correction is the most practical solution.

[править] Validation

I never metacharacter I didn't like.
-- Dale Dougherty

Regular expressions can be used to validate user input as it is typed. We present an example that uses [documentation/qtdoc4.3/qregexpvalidator.php QRegExpValidator], a class introduced in Qt 3.0, to validate an International Standard Book Number (ISBN) as it is typed into a QLineEdit.

An ISBN is a unique 10-digit number that identifies a book. The tenth digit is a checksum in the range 0 to 10, with 10 being represented by the letter 'X'. Hyphens are used to separate the number into components according to a complex algorithm to ease reading.

A simple solution that doesn't verify the checksum and doesn't validate hyphens looks like this:

 
 
 QRegExp rx( "([0-9]-?){9}[0-9X]" );
 QRegExpValidator *val = new QRegExpValidator( rx, lineEdit );
 lineEdit->setValidator( val );

Regular expressions used in validators are automatically anchored at both ends, so the entire string must match.

Validators distinguish three kinds of input text: invalid, intermediate, and acceptable. Here's an example of each kind:

Invalid: A
Intermediate: 0-340-6
Acceptable: 0-340-67998-0

If we want to validate the checksum, we can subclass [documentation/qtdoc4.3/qregexpvalidator.php QRegExpValidator] like this:

 
 
 class IsbnValidator : public QRegExpValidator
 
 {
 public:
 IsbnValidator( QObject *parent, const char *name )
     : QRegExpValidator( parent, name )
 {
     setRegExp( QRegExp("([0-9]-?){9}[0-9X]") );
 }
 
 virtual State validate( QString& str, int& index ) const
 {
     State state = QRegExpValidator::validate( str, index );
     if ( state == Acceptable ) {
 	int sum = 0;
 	int multiplier = 10;
 
 	for ( int i = 0; i < (int) str.length(); i++ ) {
 	    if ( str[i].isDigit() )
 		sum += str[i].digitValue() * multiplier--;
 	    else if ( str[i] == 'X' )
 		sum += 10;
 	}
 	if ( sum % 11 != 0 )
 	    state = Intermediate;
     }
     return state;
 }
 };

A more sophisticated IsbnValidator class would reimplement QValidator::fixup() to insert hyphens automatically at the correct positions as the user types (a non-trivial task), and would convert 'x' into 'X'.

[править] Searching and Filtering

I define Unix as "30 definitions of regular expressions living under one roof."
-- Donald E. Knuth

Regular expressions are not necessarily hard-coded by programmers; they are sometimes typed in by users. This is especially common in the Unix world, where applications such as emacs, grep, lex, and vi build much of their core functionality around regular expressions. Windows users don't miss out entirely; Microsoft Word's "Find and Replace" dialog has an option for using regular expressions with its own syntax, and Borland tools traditionally support Unix-style regular expressions.

[documentation/qtdoc4.3/qregexp.php QRegExp] natively supports two syntaxes: a powerful regular expression syntax similar to Perl's syntax, and a simple wildcard syntax reminiscent of MS-DOS and common Unix shells. It can be useful to let the user choose whether to match literal text or to use wildcards or full regular expressions in an application's "Find" dialog or when filtering data or file names. A literal string can be converted into a [documentation/qtdoc4.3/qregexp.php QRegExp] by escaping each special character with a backslash. The following literal() function does that.

 
 
 QString literal( QString str )
 {
 QString meta = "$()*+.?[\\]^{|}";
 int i = 0;
 
 while ( i < (int) str.length() ) {
     if ( meta.find(str[i]) != -1 )
 	str.insert( i++, "\\" );
     i++;
 }
 return str;
 }

Trolltech will introduce QRegExp::literal() in Qt 3.1 to provide this functionality.

Specialized applications might need to support other syntaxes. For example, a database application might offer SQL wildcard syntax, where % stands for zero or more arbitrary characters and _ stands for one character.

 
 
 QString sql = "water%";
 sql = literal( sql );
 sql.replace( QRegExp("%"), ".*" );
 sql.replace( QRegExp("_"), "." );
 QRegExp rx( sql );

Applications can use QRegExp::isValid() to warn users when they have entered an invalid regular expression. Qt 3.1 will provide access to textual error messages (e.g., "Unexpected right parenthesis at position 2").

On request, Trolltech may grant customers permission to use the [documentation/qtdoc4.3/qregexp.php QRegExp] documentation for their own end-user documentation.

[править] QRegExp Optimizations

There are two kinds of parser: fast parsers, and those that use regular expressions.
-- Rainer M. Schmid

We will now peek at how [documentation/qtdoc4.3/qregexp.php QRegExp] works behind the scenes, and review some techniques for using [documentation/qtdoc4.3/qregexp.php QRegExp] efficiently on large data sets. Be aware that the implementation details presented here might change in a future version of Qt, and that most programs don't need these optimizations.

The code

 
 
 QRegExp rx( "de+f" );

converts the string "de+f" into a complex data structure designed for fast matching. The process of converting the text of the regular expression into an internal data structure is termed "compiling a regular expression" and occurs at run-time. [documentation/qtdoc4.3/qregexp.php QRegExp] uses two optimizations to amortize the cost of constructing a [documentation/qtdoc4.3/qregexp.php QRegExp] object.

  1. Implicit sharing.
    The data structure underlying [documentation/qtdoc4.3/qregexp.php QRegExp] is shared by multiple objects that have the same regular expression pattern. This is a very common optimization in Qt; other shared classes include QBrush, QCursor, QFont, QIconSet, QMap, QPalette, QPicture, QPixmap, QRegion, and QString.

  2. Caching.
    Qt maintains a list of the most recently constructed [documentation/qtdoc4.3/qregexp.php QRegExp] objects to avoid the cost of recompiling the same regular expression. Caching makes the following code relatively efficient:
 for ( int i = 0; i < 1000; i++ ) {
 QRegExp rx( "de+f" );
 }

It is nevertheless good practice to construct [documentation/qtdoc4.3/qregexp.php QRegExp] objects outside loops when practicable.

Certain regular expression constructs slow down compiling or matching, while others can speed up matching. Here are some tips:

  • Use (?:...) instead of (...) whenever possible.
    Capturing parentheses require [documentation/qtdoc4.3/qregexp.php QRegExp] to perform additional book-keeping.

  • Use ^ or QRegExp::exactMatch() whenever possible.
    The following example gives the same results with or without the ^, but is much faster if ^ is used:
if ( line.find(QRegExp("^(SRC|SOURCE)=")) == 0 ) {
...
}
  • Factor out common sub-expressions.
    A regular expression such as th(?:is|at) is faster than this|that.

  • Avoid back-references (\\1, \\2, ...), repetitions ({1,500}), and lookaheads ((?=...) and (?!...)).
    [documentation/qtdoc4.3/qregexp.php QRegExp] is not optimized for these exotic constructs.

  • Avoid nondeterminism.
    Nondeterminism occurs when a regular expression matches a string in many ways. For example, the regular expression (a|a) matches the string "a" in two ways, and (a|a)+ matches "aaaa" in 16 different ways. More pathological examples are (a+){64} and (a+)+. [documentation/qtdoc4.3/qregexp.php QRegExp] tends to be less sensitive to nondeterminism than Perl, but more sensitive than AWK.

Some optimizations are specific to QRegExp::search(), a function that searches a potentially large text for a match.


rx.search( "abcdefghijklmnopqrstuvwxyz" );

Conceptually, search() attempts to match the regular expression de+f at positions 0, 1, 2, ..., of the string "abcdefg..." as illustrated below.

abcdefg... abcdefg... abcdefg... abcdefg...
^0 ^1 ^2 ^3
de+f de+f de+f de+f

Four attempts are made before a match is found ("def" at position 3). Attempting a match at every position in the target string is expensive, and can be avoided using heuristics. An analysis of the regular expression de+f provides the following information:

  • The minimum length of any matched text is 3 (e.g., "def").
  • Any match must start with the fixed text "de".
  • The letter "d" occurs earliest at position 0 in a match, "e" at position 1, and "f" at position 2. No other character may appear in the match.

QRegExp::search() uses this information and automatically selects one of the following heuristics to locate potential matches: the good-substring heuristic or the bad-character heuristic.

  1. Good-substring heuristic.
    A "good substring" is a fixed string that must appear in any match at a more or less known position. Here are some examples of regular expressions and good substrings: {| border="1" cellpadding="1"

| Regular expression | Good substring | Position in match |- | de+f | de |- | while | while |- | a{0,3}while | while | 0, 1, 2 or 3 |- | (worth)?while | while | 0 or 5 |- | for|while | none |}If the good-substring heuristic is selected, QRegExp::search() only runs the full-blown matching engine when the potential match contains the fixed string in a correct position. For example, the full matcher is only run twice (instead of 26 times) on the following string:

headed deer ddd eee fff

QRegExp::search() locates good substrings using the QString::find(const QString& str, int index) function.

  1. Bad-character heuristic.
    "Bad characters" are characters that cannot occur in a match. If this heuristic is selected, QRegExp::search() only runs the full-blown matching engine when the potential match contains no bad characters. For example, de+f cannot possibly match "dxf" since "x" cannot appear in any match, so there is no use running the matching engine on it. The bad-character heuristic not only distinguishes between good and bad characters, but also recognizes that some good characters are better than others. In de+f, "d" may occur in position 0, where neither "e" nor "f" may occur.The full matcher is only run three times on the following string:
a "hea"ded deer ddd eee fff

To benefit from the bad-character heuristic, it is best to avoid large character classes (e.g., [^\n]) as well as . (dot), \\d, \\s, \\w, \\D, \\S, and \\W. The heuristic works best when most characters are "bad," and only a selected few are "good."


Case Insensitivity and Performance

Many regular expression interpreters, including the Perl 5.0 interpreter, implement case-insensitive search by converting the regular expression and the whole searched text to lower-case beforehand. [documentation/qtdoc4.3/qregexp.php QRegExp] doesn't suffer from this gratuitous inefficiency; however, be aware that the bad-character heuristic is only available for case-sensitive matches. Case-insensitive regular expressions that do not contain a good substring can be replaced by a case-sensitive regular expression, if they are heavily used. For example, change true|false to [Tt][Rr][Uu][Ee]|[Ff][Aa][Ll][Ss][Ee] -- or to true|TRUE|false|FALSE if that's what you mean.



The internal function heuristicallyChooseHeuristic() selects the appropriate heuristic based on the regular expression. It calculates scores for the good-substring heuristic and for the bad-character heuristic according to a mathematical formula. The following table shows the scores for different regular expressions, with the good substring shown underlined.


Regular expression Good substring heuristic Bad character heuristic
while 64 31
\\bwhile\\b 64 31
(worth)?while 59 31
[Ww][Hh][Ii][Ll][Ee] 30
while 31
[A-Z_a-z][A-Z_a-z0-9]* 18
while)


Curious or enthusiastic readers can compile their own statistics by inserting the following statement near the end of heuristicallyChooseHeuristic() in qregexp.cpp:

 
 qDebug( "gsh %d, bch %d, '%s' at %d..%d",
     goodStringScore, badCharScore,
     goodString.isNull() ? "" : goodString.latin1(),
     goodEarlyStart, goodLateStart );

The internal function dump() may also be useful for such investigations.

The good-substring and bad-character heuristics are used in QRegExp::search() and QString::find(), but not in QRegExp::searchRev() or QString::findRev(), so reverse searches can be significantly slower.

The most effective way to optimize a regular expression is not to use one at all. Since regular expressions are "compiled" at run-time, they can never be as fast as hand-optimized code that doesn't use them. For example, on a large string, the following statement performs poorly:

 
 text.replace( QRegExp("\r\n"), "\n" );

This behavior is partly due to [documentation/qtdoc4.3/qregexp.php QRegExp], and partly due to the naive implementation of QString::replace(). The following code can be used when performance is an issue:

 
 int j = 0;
 for ( int i = 0; i < (int) text.length(); i++ ) {
 if ( text[i] != '\r' || text[i + 1] != '\n' )
     text[j++] = text[i];
 }
 text.truncate( j );

Trolltech will overload QString::replace() to accept a plain string as its first argument in Qt 4.0, diminishing the need for such code. (A subtle source-compatibility issue prevents its inclusion in Qt 3.x.)

[править] Mathematical Trivia

Every program has (at least) two purposes: the one for which it was written, and another for which it wasn't.
-- Alan J. Perlis

Although most of us only consider regular expressions in the context of string manipulation, regular expressions are also capable of number-theoretical stunts best implemented using traditional programming techniques.

We are going to work with positive integers, so first we must find a convenient way to represent them. Computer scientists are familiar with binary, octal, decimal and hexadecimal; logicians and pre-school children with unary notation. We will find unary notation the most suitable for our twisted purposes. For example, the unary notations for 1, 2, 3, 4, and 5 are I, II, III, IIII, and IIIII.

In unary, the regular expression (II)+ matches the even integers, and I(II)* the odd integers. For example,

 
 QRegExp("I(II)*").exactMatch("IIII")

returns FALSE, since we used the odd regular expression and 4 is even.

The regular expression (I\\1*) matches powers of two: 1, 2, 4, 8, 16, 32, ..., 2n, etc. It relies on a [documentation/qtdoc4.3/qregexp.php QRegExp] implementation detail: \\1 within the first capturing parentheses expands to the text captured so far. For example, once I is matched, a second I can be matched by \\1, then two more to bring the total to four, then four more, and so on. Similarly, (\\1I)+ matches triangular numbers: 1, 3, 6, 10, ..., n(n + 1)/2, etc.

A popular regular expression from the Perl world is one that matches composite numbers: (II+)\\1+. Its companion (?!(II+)\\1+$)I* matches prime numbers. These regular expressions are highly nondeterministic in the sense explained above. [documentation/qtdoc4.3/qregexp.php QRegExp]'s approach to nondeterminism, while making the common case faster, prevents these two regular expressions from fulfilling their mission. This is solely an issue for regular expressions that use back-references in perverse, undocumented ways.

On the other hand, [documentation/qtdoc4.3/qregexp.php QRegExp] is perhaps unique in that it supports Fibonacci numbers. Fibonacci numbers are members of the sequence 1, 1, 2, 3, 5, 8, 13, 21, 34, ..., where each number is the sum of its two predecessors (e.g., 8 + 13 = 21, 13 + 21 = 34). They are found mostly in nature and in computer science textbooks. The following two regular expressions match Fibonacci numbers:


(?:((?:^I)?\\2\\3)(\\3\\1|(?=I$))(\\1\\2|(?=I$)))*I


(I(?:(\\1\\3)(\\1\\2))*(?:\\1\\3)?)|
((?:(\\4(?:^I)?\\6)(\\4\\5))*(?:\\4\\6)?)

Both contain back-references that precede the corresponding capturing parentheses, a sanity-challenging feature that no other popular regular expression interpreter supports.

[править] Bugs?

Qt 3.0's [documentation/qtdoc4.3/qregexp.php QRegExp] implementation has been used within Trolltech since August 2000. Very few bugs have been discovered, and all of them were fixed prior to the Qt 3.0 release. In fact, the most common report regarding [documentation/qtdoc4.3/qregexp.php QRegExp] is that

 
 text.replace( QRegExp("\\"), "/" );

doesn't convert backslashes into forward slashes. Nor should it; the correct code is

 
 text.replace( QRegExp("\\\\"), "/" );

[править] References

The theory behind regular expressions is covered in Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman. The O'Reilly book Mastering Regular Expressions by Jeffrey Friedl is a pleasant read for anybody interested in regular expressions. Be aware, though, that most of the optimizations presented in Friedl's book only apply to the Perl 5.0 engine. Finally, the good-string and bad-character heuristics were inspired from the Boyer-Moore algorithm, described in section 34 of Introduction to Algorithms by Cormen, Leiserson, and Rivest.

Copyright © 2002 Trolltech.

Trademarks