Regex compatibility

kaz at maczuka.gcd.org (Ito Kazumitsu) · Sun Feb 5 01:50:25 2006

Hi,

From: Mark Wielaard <mark@xxxxxxxxx>
Date: Fri, 27 Jan 2006 13:39:23 +0100

> > The topic here is regular expressions, where I am seeing a lot of tests
> > fail. Is there any information available as to the precise specification of
> > the regular expressions supported by Gnu Classpath, and/or the differences
> > from the Sun JRE?

> Ito has been working hard to get the syntax compatible with the one
> expected from util.regex. Most of his fixes are very recent though and
> have been added after the 0.20 snapshot.

The reason I started working on gnu.regexp was XML related, just as
Michael found problems in GNU Classpath's regex handling for XML
processing. It could not even accept such simple regex's as /(a*)*/
which is needed to verify a XML document as "<a/>".

I have just submitted another proposed patch to the classpath-patches
mailing list, which is supposed to fix Bug#25812 in GCC Bugzilla Bug.
And that is, as far as I can see, the last regex related bug reported
in GCC Bugzilla Bug.

But that is not the end of the story.  gnu.regexp has more bugs to be
fixed. Mauve testcases gnu/testlet/java/util/regex/Pattern/testdata2
lists regex's that gnu.regexp cannot process.

The problem is I am not a regex expert at all. I am not familiar
with such important technical terms as NFA and DFA found in
http://www.oreilly.com/catalog/regex2/chapter/ch08.pdf

In order to proceed further, deeper knowledge about regex would be needed.
Let's take an example in the Mauve testcase.

/^(b+?|a){1,2}?c/
    bbac
 0: bbac
 1: a

This means the string "bbbc" should match the regex /^(b+?|a){1,2}?c/
and the captured group #1 should be "a".

But gnu.regexp works this way.

  ^ matches the beginning of the string.
  b+?  matches "b" and the reluctant operator "?" suppresses further matching.
  (b+?|a) matches "b".
  (b+?|a){1,2}? mathes "b" and the reluctant operator "?" suppresses
                further matching.
  (b+?|a){1,2}?c does not match "bb" and the matching fails.

This bug can be fixed by setting stopMatchingIfSatisfied in
RETokenRepeated to false. If it is not desired, we can make
the code dynamically switch stopMatchingIfSatisfied (when
looking for all possibilities, it is set to false, and
when finding a match reluctanly, it is set to true).

This approach works quite well. But it makes a regression test fail.
Let's take a example.

/(([a-c])b*?\2)*/
    ababbbcbc
 0: ababb
 1: bb
 2: b

With stopMatchingIfSatisfied being set to false, gnu.regexp works this way.

  0: ([a-c])b*?\2) matches "ababbbcbc" zero times. 

  1: ([a-c])b*?\2) matches "aba", and the remaining string is "bbbcbc".
     Let's look for ([a-c])b*?\2) in "bbbcbc".

  2: ([a-c]) matches "b".
  3: b* matches "", "b" and "bb"
     ("" comes first because b* is followed by ?).

  3.1: In case b* matches "",
  3.1.1: ([a-c])b*?\2) matches "bb", and the remaining text is "bcbc".
         Let's look for ([a-c])b*?\2) in "bcbc".
  3.1.2: ([a-c])b*?\2) does not match "bcbc".

  3.2: In case b* matches "b",
  3.2.1: ([a-c])b*?\2) matches "bbb", and the remaining text is "cbc".
  3.2.2: ([a-c])b*?\2) matches "cbc".

  3.3: In case b* matches "bb", the remaining string
       does not match ([a-c])b*?\2).

  Matches found for (([a-c])b*?\2)* are, in the order of the
  number of repeats,

       ababbb(cbc)   --- 3.2.2
       aba(bb)       --- 3.1.1
       aba(bbb)      --- 3.2.1
       (aba)         --- 1
       ()            --- 0

       () encloses the last match for ([a-c])b*?\2).

  And the longest match ababbb(cbc) is treated as the final result.

  But other implementations (Sun's JDK, Perl, Ruby, etc.) seem to
  stop at 3.1.2 and do not go further to 3.2. So the final result
  is aba(bb).

In short, gnu.regexp performs a width-first search and other
implementations seem to perform a depth-first search. "Width"
here means possible matches and "depth" means the number of repeats.

Does a depth-first search resolve this proble?

I tried and faced another problem.

With a depth-first search, some optimization now done in RETokenRepeated
cannot easily be done. It is commented in RETokenRepeated.java as

            //   (2) when the same string matches this token many times.
            //       For example, "acbab" itself matches "a.*b" and
            //       its substrings "acb" and "ab" also match.
            //       In this case, we do not have to go further until
            //       numRepeats == max because the more numRepeats grows,
            //       the shorter the substring matching this token becomes.
            //       So the previous succesful match must have bee the best
            //       match.  But this is not necessarily the case if stingy.

Without this optimization, finding /(a+)*b/ from "aaaaaaaaaaaaaaaaa"
takes a killing long time before the matching fails.

Perl and Ruby are very fast in finding "aaaaaaaaaaaaaaaaa" =~ /(a+)*b/
is false.  Sun's JDK is rather slow in this case, but not so killingly
slow.  They must be doing some optimization for this situation.

I think I will continue to improve gnu.regex But I have been too
addicted to regex these days, so I will slow down.