You can make a pretty effective time/memory tradeoff -- particularly
if your array of patterns is relatively fixed -- but unless you re-
implement this in C, it will probably be slower than a brute force
approach with the str* functions unless you are searching for a
fairly huge number of needles.
Below, I build a tree of pattern matches, which you navigate branch-
by-branch by using successive letters of the pattern. So, after the
tree is built, the index $tree[ 'h' ][ 'a' ][ 't' ] exists and has
the value `true'. To search the string, you push cursors onto a
stack, advance them as you advance through the string, and destroy
them when they hit a dead end.
This has a worst case of O( N^2 ) (on the length of the string),
which happens when you search for patterns like "hhhhhhhhhhhhhhhha"
and "hhhhhhhhhhhhhhhhb" in the string "hhhhhhhhhhhhhhhhh". The
average case for random or typical strings should be very close to O
( N ). Critically, search time is a function only of the length of
the string, not of the number of patterns. Memory usage is also only
a function of the size of the character set and the length of the
longest pattern, not of the total number of patterns.
This particular implementation has some nice side-effects; uncomment
the var_dump() line and note how 'pirate' and 'firm' have been
optimized out of the tree in favor of 'pir' and 'fir'.
Evan
<?php
$pattern = array(
'hat'
,'hut'
,'blue dog'
,'blue cat'
,'pirate'
,'pir'
,'fir'
,'firm'
);
$tree = array( );
foreach( $pattern as $str ) {
$len = strlen( $str );
$cursor =& $tree;
for( $ii = 0; $ii < $len; $ii++ ) {
if( !isset( $cursor[ $str[ $ii ] ] ) ) {
$cursor[ $str[ $ii ] ] = array( );
}
$cursor =& $cursor[ $str[ $ii ] ];
if( $cursor === true ) break;
}
$cursor = true;
}
/* var_dump
( $tree ); */
$corpus = 'lorem ipsum dolor sur amet pi blue do he hi h fipir
slog';
$clen = strlen( $corpus );
$stack = array( );
for( $ii = 0; $ii < $clen; $ii++ ) {
$c = $corpus[ $ii ];
foreach( $stack as $k => $_ ) {
if( isset( $stack[ $k ][ $c ] ) ) {
if( $stack[ $k ][ $c ] === true ) {
die( 'Matched at position '.$ii );
}
$stack[ $k ] =& $stack[ $k ][ $c ];
}
else {
unset( $stack[ $k ] );
}
}
if( isset( $tree[ $c ] ) ) {
$stack[] =& $tree[ $c ];
}
}
die( 'Did not find a match.' );
?>
On May 17, 2006, at 12:37 PM, René Fournier wrote:
Looking for suggestions on the most compute-efficient way to search
variable-length strings (~200 characters) for the occurrence of one
of about 100 possible needles. In other words:
$needles = array ( 1 => "Hello Jim" , 2 => "Yellow Banana" , 3 =>
"Red Car", ... etc .... (100 elements)
$haystack = "Once upon a time there was a programming language that
everyone loved. Its name was PHP, and the people that worked with
it also liked Yellow Bananas. One day...";
Now perhaps I'm blind, but I can't see an obvious string or array
function that simply allows you to use an array as a needle while
searching a string haystack. What I'm really looking for is
something like the reverse of in_array.
The obvious thing would be to loop through the needles and run
substr_count on the haystack... But is there a faster way?
...Rene
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php