Hi, From: Mark Wielaard <mark@xxxxxxxxx> Date: Fri, 27 Jan 2006 13:39:23 +0100 > > The topic here is regular expressions, where I am seeing a lot of tests > > fail. Is there any information available as to the precise specification of > > the regular expressions supported by Gnu Classpath, and/or the differences > > from the Sun JRE? > Ito has been working hard to get the syntax compatible with the one > expected from util.regex. Most of his fixes are very recent though and > have been added after the 0.20 snapshot. The reason I started working on gnu.regexp was XML related, just as Michael found problems in GNU Classpath's regex handling for XML processing. It could not even accept such simple regex's as /(a*)*/ which is needed to verify a XML document as "<a/>". I have just submitted another proposed patch to the classpath-patches mailing list, which is supposed to fix Bug#25812 in GCC Bugzilla Bug. And that is, as far as I can see, the last regex related bug reported in GCC Bugzilla Bug. But that is not the end of the story. gnu.regexp has more bugs to be fixed. Mauve testcases gnu/testlet/java/util/regex/Pattern/testdata2 lists regex's that gnu.regexp cannot process. The problem is I am not a regex expert at all. I am not familiar with such important technical terms as NFA and DFA found in http://www.oreilly.com/catalog/regex2/chapter/ch08.pdf In order to proceed further, deeper knowledge about regex would be needed. Let's take an example in the Mauve testcase. /^(b+?|a){1,2}?c/ bbac 0: bbac 1: a This means the string "bbbc" should match the regex /^(b+?|a){1,2}?c/ and the captured group #1 should be "a". But gnu.regexp works this way. ^ matches the beginning of the string. b+? matches "b" and the reluctant operator "?" suppresses further matching. (b+?|a) matches "b". (b+?|a){1,2}? mathes "b" and the reluctant operator "?" suppresses further matching. (b+?|a){1,2}?c does not match "bb" and the matching fails. This bug can be fixed by setting stopMatchingIfSatisfied in RETokenRepeated to false. If it is not desired, we can make the code dynamically switch stopMatchingIfSatisfied (when looking for all possibilities, it is set to false, and when finding a match reluctanly, it is set to true). This approach works quite well. But it makes a regression test fail. Let's take a example. /(([a-c])b*?\2)*/ ababbbcbc 0: ababb 1: bb 2: b With stopMatchingIfSatisfied being set to false, gnu.regexp works this way. 0: ([a-c])b*?\2) matches "ababbbcbc" zero times. 1: ([a-c])b*?\2) matches "aba", and the remaining string is "bbbcbc". Let's look for ([a-c])b*?\2) in "bbbcbc". 2: ([a-c]) matches "b". 3: b* matches "", "b" and "bb" ("" comes first because b* is followed by ?). 3.1: In case b* matches "", 3.1.1: ([a-c])b*?\2) matches "bb", and the remaining text is "bcbc". Let's look for ([a-c])b*?\2) in "bcbc". 3.1.2: ([a-c])b*?\2) does not match "bcbc". 3.2: In case b* matches "b", 3.2.1: ([a-c])b*?\2) matches "bbb", and the remaining text is "cbc". 3.2.2: ([a-c])b*?\2) matches "cbc". 3.3: In case b* matches "bb", the remaining string does not match ([a-c])b*?\2). Matches found for (([a-c])b*?\2)* are, in the order of the number of repeats, ababbb(cbc) --- 3.2.2 aba(bb) --- 3.1.1 aba(bbb) --- 3.2.1 (aba) --- 1 () --- 0 () encloses the last match for ([a-c])b*?\2). And the longest match ababbb(cbc) is treated as the final result. But other implementations (Sun's JDK, Perl, Ruby, etc.) seem to stop at 3.1.2 and do not go further to 3.2. So the final result is aba(bb). In short, gnu.regexp performs a width-first search and other implementations seem to perform a depth-first search. "Width" here means possible matches and "depth" means the number of repeats. Does a depth-first search resolve this proble? I tried and faced another problem. With a depth-first search, some optimization now done in RETokenRepeated cannot easily be done. It is commented in RETokenRepeated.java as // (2) when the same string matches this token many times. // For example, "acbab" itself matches "a.*b" and // its substrings "acb" and "ab" also match. // In this case, we do not have to go further until // numRepeats == max because the more numRepeats grows, // the shorter the substring matching this token becomes. // So the previous succesful match must have bee the best // match. But this is not necessarily the case if stingy. Without this optimization, finding /(a+)*b/ from "aaaaaaaaaaaaaaaaa" takes a killing long time before the matching fails. Perl and Ruby are very fast in finding "aaaaaaaaaaaaaaaaa" =~ /(a+)*b/ is false. Sun's JDK is rather slow in this case, but not so killingly slow. They must be doing some optimization for this situation. I think I will continue to improve gnu.regex But I have been too addicted to regex these days, so I will slow down.