Re: Fast search

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



You can make a pretty effective time/memory tradeoff -- particularly if your array of patterns is relatively fixed -- but unless you re- implement this in C, it will probably be slower than a brute force approach with the str* functions unless you are searching for a fairly huge number of needles.

Below, I build a tree of pattern matches, which you navigate branch- by-branch by using successive letters of the pattern. So, after the tree is built, the index $tree[ 'h' ][ 'a' ][ 't' ] exists and has the value `true'. To search the string, you push cursors onto a stack, advance them as you advance through the string, and destroy them when they hit a dead end.

This has a worst case of O( N^2 ) (on the length of the string), which happens when you search for patterns like "hhhhhhhhhhhhhhhha" and "hhhhhhhhhhhhhhhhb" in the string "hhhhhhhhhhhhhhhhh". The average case for random or typical strings should be very close to O ( N ). Critically, search time is a function only of the length of the string, not of the number of patterns. Memory usage is also only a function of the size of the character set and the length of the longest pattern, not of the total number of patterns.

This particular implementation has some nice side-effects; uncomment the var_dump() line and note how 'pirate' and 'firm' have been optimized out of the tree in favor of 'pir' and 'fir'.

Evan

<?php

    $pattern = array(
         'hat'
        ,'hut'
        ,'blue dog'
        ,'blue cat'
        ,'pirate'
        ,'pir'
        ,'fir'
        ,'firm'
        );

    $tree = array( );
    foreach( $pattern as $str ) {
        $len = strlen( $str );
        $cursor =& $tree;
        for( $ii = 0; $ii < $len; $ii++ ) {
            if( !isset( $cursor[ $str[ $ii ] ] ) ) {
                $cursor[ $str[ $ii ] ] = array( );
                }
            $cursor =& $cursor[ $str[ $ii ] ];
            if( $cursor === true ) break;
            }
        $cursor = true;
        }

/* var_dump ( $tree ); */

$corpus = 'lorem ipsum dolor sur amet pi blue do he hi h fipir slog';
    $clen   = strlen( $corpus );

    $stack  = array( );
    for( $ii = 0; $ii < $clen; $ii++ ) {
        $c = $corpus[ $ii ];
        foreach( $stack as $k => $_ ) {
            if( isset( $stack[ $k ][ $c ] ) ) {
                if( $stack[ $k ][ $c ] === true ) {
                    die( 'Matched at position '.$ii );
                    }
                $stack[ $k ] =& $stack[ $k ][ $c ];
                }
            else {
                unset( $stack[ $k ] );
                }
            }
        if( isset( $tree[ $c ] ) ) {
            $stack[] =& $tree[ $c ];
            }
        }
    die( 'Did not find a match.' );

?>



On May 17, 2006, at 12:37 PM, René Fournier wrote:

Looking for suggestions on the most compute-efficient way to search variable-length strings (~200 characters) for the occurrence of one of about 100 possible needles. In other words:



$needles = array ( 1 => "Hello Jim" , 2 => "Yellow Banana" , 3 => "Red Car", ... etc .... (100 elements)

$haystack = "Once upon a time there was a programming language that everyone loved. Its name was PHP, and the people that worked with it also liked Yellow Bananas. One day...";



Now perhaps I'm blind, but I can't see an obvious string or array function that simply allows you to use an array as a needle while searching a string haystack. What I'm really looking for is something like the reverse of in_array.

The obvious thing would be to loop through the needles and run substr_count on the haystack... But is there a faster way?

...Rene

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux