Re: git diff looping?

Jeff King <peff@xxxxxxxx> · Wed, 17 Jun 2009 06:23:33 -0400

On Wed, Jun 17, 2009 at 10:46:21AM +0200, Paolo Bonzini wrote:

> 2) make sure that at least one space/tab is eaten on all but the last  
> occurrence of the repeated subexpression.  To this end the LHS of {2,} is 
> duplicated, once with [ \t]+ and once with [ \t]*.  The repetition itself 
> becomes a + since the last occurrence is now separately handled:
>
> ^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*
> [ \t]*\([^;]*)$

Thanks, I can confirm that this is _much_ faster. Here are some timings
from my Solaris 8 box for the "git diff v0.4.0" case using the system
and compat engines, and using three regexes: the original that git is
using now, an updated one with your regex above[1] replacing the second
line of the stock pattern, and a baseline regex of "." which should take
virtually no time at all.

  system,  orig: infinite
  system, paolo:   2.5s
  system,   ".":   0.6s
  compat,  orig: 288.0s
  compat, paolo:   1.5s
  compat,   ".":   0.6s

So it goes from infinite to 2.5s. Which still spends 3 times as long
matching funcname regexes as it does actually calculating the diff. The
compat library is a little better, but still chokes pretty badly on the
original regex.

Let's compare compat to the glibc implementation on my Debian box:

  system,  orig:   0.22s
  system, paolo:   0.22s
  system,   ".":   0.15s
  compat,  orig: 150.88s
  compat, paolo:   0.43s
  compat,   ".":   0.15s

Besides the exponential behavior on the original regex, it is still
about twice as slow as the system one.

So I think there are three possible optimizations worth considering:

  1. Replace the builtin diff.java.xfuncname pattern with what Paolo
     suggested (though I haven't verified its correctness beyond a
     cursory look at the results). This is easy to do, and will help
     people with crappy system regex libraries and people on
     compat/regex/ (right now just mingw) a _lot_. The downside is that
     it's a little harder to read the regex, but not terribly so.

  2. Recommend NO_REGEX for people with slow system regex libraries.
     This is also easy to do, and will help people even if we do (1) for
     two reasons:

       a. we process user-defined regexes through diff.*.xfuncname
          patterns, as well as through "git grep"; so we are protecting
          against poor performance when they give us a complex regex

       b. even on more reasonable regexps like Paolo's, we seem to get a
          2:1 speedup over the Solaris system library

  3. Replace compat/regex with something faster. It still produces
     exponential behavior in complex cases where glibc does not, and it
     seems to be about 1/3 as fast on Paolo's regex.

     I haven't looked at how large or how portable the glibc
     implementation is. Another alternative is that we could provide a
     simple compat/ as now, and have better support for linking against
     an external library like pcre, if it is available.

-Peff

[1] Note if you are cutting and pasting Paolo's regex into the C code,
    the "\(" needs to be "\\(", which I screwed up in my initial
    timings. :)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html