Re: Fast search

Evan Priestley <spam@xxxxxxxx> · Wed, 17 May 2006 17:27:39 -0400

You can make a pretty effective time/memory tradeoff -- particularly  
if your array of patterns is relatively fixed -- but unless you re- 
implement this in C, it will probably be slower than a brute force  
approach with the str* functions unless you are searching for a  
fairly huge number of needles.

Below, I build a tree of pattern matches, which you navigate branch- 
by-branch by using successive letters of the pattern. So, after the  
tree is built, the index $tree[ 'h' ][ 'a' ][ 't' ] exists and has  
the value `true'. To search the string, you push cursors onto a  
stack, advance them as you advance through the string, and destroy  
them when they hit a dead end.

This has a worst case of O( N^2 ) (on the length of the string),  
which happens when you search for patterns like "hhhhhhhhhhhhhhhha"  
and "hhhhhhhhhhhhhhhhb" in the string "hhhhhhhhhhhhhhhhh". The  
average case for random or typical strings should be very close to O 
( N ). Critically, search time is a function only of the length of  
the string, not of the number of patterns. Memory usage is also only  
a function of the size of the character set and the length of the  
longest pattern, not of the total number of patterns.

This particular implementation has some nice side-effects; uncomment  
the var_dump() line and note how 'pirate' and 'firm' have been  
optimized out of the tree in favor of 'pir' and 'fir'.

Evan

<?php

    $pattern = array(
         'hat'
        ,'hut'
        ,'blue dog'
        ,'blue cat'
        ,'pirate'
        ,'pir'
        ,'fir'
        ,'firm'
        );

    $tree = array( );
    foreach( $pattern as $str ) {
        $len = strlen( $str );
        $cursor =& $tree;
        for( $ii = 0; $ii < $len; $ii++ ) {
            if( !isset( $cursor[ $str[ $ii ] ] ) ) {
                $cursor[ $str[ $ii ] ] = array( );
                }
            $cursor =& $cursor[ $str[ $ii ] ];
            if( $cursor === true ) break;
            }
        $cursor = true;
        }

/*  var_dump 
( $tree );                                                  */

    $corpus = 'lorem ipsum dolor sur amet pi blue do he hi h fipir  
slog';
    $clen   = strlen( $corpus );

    $stack  = array( );
    for( $ii = 0; $ii < $clen; $ii++ ) {
        $c = $corpus[ $ii ];
        foreach( $stack as $k => $_ ) {
            if( isset( $stack[ $k ][ $c ] ) ) {
                if( $stack[ $k ][ $c ] === true ) {
                    die( 'Matched at position '.$ii );
                    }
                $stack[ $k ] =& $stack[ $k ][ $c ];
                }
            else {
                unset( $stack[ $k ] );
                }
            }
        if( isset( $tree[ $c ] ) ) {
            $stack[] =& $tree[ $c ];
            }
        }
    die( 'Did not find a match.' );

?>

On May 17, 2006, at 12:37 PM, René Fournier wrote:

Looking for suggestions on the most compute-efficient way to search  
variable-length strings (~200 characters) for the occurrence of one  
of about 100 possible needles. In other words:

$needles = array ( 1 => "Hello Jim" , 2 => "Yellow Banana" , 3 =>  
"Red Car", ... etc .... (100 elements)

$haystack = "Once upon a time there was a programming language that  
everyone loved. Its name was PHP, and the people that worked with  
it also liked Yellow Bananas. One day...";

Now perhaps I'm blind, but I can't see an obvious string or array  
function that simply allows you to use an array as a needle while  
searching a string haystack. What I'm really looking for is  
something like the reverse of in_array.

The obvious thing would be to loop through the needles and run  
substr_count on the haystack... But is there a faster way?

...Rene

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php