Re: (class=ham score=-4.96032) memmem.c improvement

Tor Myklebust <tmyklebu@xxxxxxxxxxxxxxxxxxx> · Fri, 30 Nov 2007 20:58:10 -0500 (EST)

On Sat, 1 Dec 2007, Christian Thaeter wrote:

Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to 
provide a generic fallback in some code. So, lets borrow it somewhere 
else:

First looking at git's compat/memmem.c which I found was suboptimal with 
roughly O(M*N) performance (albeit ok for that case since it was just a 
generic fallback).

Well, second taking a look at glibc's source surprised me, there is the 
same code as in git. I somewhat expected a faster implementation from a 
generic C library.

I don't think anybody involved with glibc really feels like having 
strstr() (or memmem(), for that matter) go fast.  (The grounds I heard 
were that "applications requiring large searches are more likely to have 
own fast string search implementations.')

That thought and done, the code is attached to this mail. The algorithm 
is similar to the Rabin/Karp method for string searches but uses a 
weaker and simpler additive rolling hash.

There are rolling hashes based on arithmetic in fields of order 2^k. 
These can typically be computed using a bunch of bit-shifts and additions; 
have you tried using one?  There are lots of irreducible polynomials over 
Z_2 of degree k, so you can even fall back to a different one every few 
false positives.

The result is still somewhat like O(M+N) for most cases

I don't think you get linear performance in the average case.  (But I do 
think you shave a factor of 256 off of the quadratic term.  The same 
algorithm, where your hash function is the population count, gives a 
collision between two random strings of length m with probability 
sum_(i=0)^m (m choose i)^2 / 4^m, which grows like sqrt(m).  Your 
algorithm helps this by a factor of 256.)

(There may be corner cases where it not that good, but its really hard 
to imagine those).

The needle "1100110011001100...1100" and the haystack 
"10101010101010...10" will produce quite a few false matches with your 
hash function (simply because every substring of the haystack of the 
appropriate length has the same hash as your needle).  (Making the needle 
"1010101010...101100" makes it even worse.)

Anyways, it is always faster than the current implementation,

Try the example above.  (Your code also needs a "return NULL;" at the end, 
by the way.)

in my tests with common data (parsing multipart/form-data cgi responses) 
it yields approx 20%-100% improvements and with little artificial 
cheating (having strings in haystack which only differ at the last char 
of needle) the improvements are much better (500%, since is another 
complexity class). The fact is that the old implementation does a brute 
force search which has corner cases which are quite easy to hit 
(repeating symbol sequences, small set of possible symbols) while for my 
implementation the corner cases are much more rare and even then, it 
will still perform better than the old implementation.

The old implementation is about 50% faster than yours in the example 
above.

The code is not much more complex as the old one, not the original 
'Rabin/Karp' algorithm because that would require a efficient modulo 
operation with a prime number which might be slower on some platforms 
(just a guess, I didn't even tried, the performance satisfies me 
perfectly).

Karp-Rabin with hashing done mod a prime is an impractical string matching 
algorithm.

Other search algorithms like 'Knuth/Morris/Pratt'

A KMP-like algorithm can be implemented in a way that uses constant 
additional space.  For unordered alphabets, it's called "Galil-Seiferas" 
and for ordered alphabets the analogous algorithm is to decompose (in 
linear time and constant space) the string into its lexicographically 
largest suffix and some prefix.  Then you look for the suffix.  Everywhere 
where the suffix appears and appears at least prefix-length later than the 
last occurrence of the suffix, you check for the prefix.

or 'Boyer/More' are more complex to implement and require O(Needle) 
extra memory

There are variants of both that require constant additional space.  (They 
also perform way better than anything else I've seen.)

for common implementations, which reserve them for special purposes imo. 
I just wanted to keep it simple but still considerably better than the 
current one.

As it happens, the extra space consumed by the KMP and suffix shift tables 
(and the associated bad cache effects) make the usual KMP and Boyer-Moore 
way, way slower than naive string matching on random inputs.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html