Winfried Meining wrote:
I am writing on a script that parses a PHP script and finds all function calls
to check, if these functions exist. To do this, I needed a function that would
strip out all text, which is enclosed in apostrophes or quotation marks. This
is somewhat tricky, as the script needs to be aware of what really is an
enclosed text and what is PHP code. Apostrophes in quotation mark enclosed text
should be ignored and quotation marks in apostrophe enclosed text should be
ignored, as well. Similarly, escaped apostrophes in apostrophe enclosed text
and escaped quotation marks in quotation mark enclosed text should be ignored.
The following function uses preg_match to do this job.
[···]
I noticed that this function is very slow, in particular because
preg_match("/^(.*)some_string(.*)$/", $text, $matches);
always seems to find the *last* occurrence of some_string and not the *first*
(I would need the first). There is certainly a way to write another version
where one looks at every single character when going through $text, but this
would make the code much more complex.
IIRC regexp search from left to right but match from right to left,
hence going trough the string while the first part matches and going
backwards every time the next part fails to match, and so on for the
whole expression, that's why "(?>...)", "once-only subpatterns", exists for.
Now, you're telling (from left to right) to look for "any sequence of
any characters followed by 'some_string' and, once again, followed by
any sequence of any characters". As long as "(.*)" matches it will loop
the whole string till it fails, once it does it will try to match
"some_text", if it doesn't it will try to match "some_string" once again
from the current position minus one and so on, until it matches (or
"(.*)" fails --at the beginning of the string), after "some_string"
matches it will repeat the first step for the second "(.*)" on the
pattern, so the process will be quite slow.
--I hope this has had some sense for you (somehow it lost it for me)
Also, by default regexp are "greedy", which means "+" and "*"
meta-characters will go on and on. In your case, you most likely will
need to "limit" this behaviour by specifying them as "ungreedy" (that
is, it will try to match the next part after each "+"/"*" matches), you
can do this adding a "?" after these meta-characters (e.g. ".+?-")
I wonder, if there is a faster *and* simple way to do the same thing.
Mhh... what about something like
preg_replace('/(["\']).*?(?<!\\\)\\1/X', '', $code)
? it's not 100% accurate, though, if you find something like '\\' it
will fail --I guess you should replace these before running the regexp
After trying a little, I found that this code below seems to be quite
acceptable, you may want to try it:
/**
* Returns an array with the identified function-calls found
* (including "function declarations", e.g. "function my_func")
*
* @param string $code
* @return string
* @since Mon Apr 10 01:13:28 CDT 2006
* @author rsalazar
*/
function getFunctionCalls( $code ) {
$result = FALSE;
// try to strip away literal strings
$code = preg_replace('/(["\']).*?(?<!\\\)\\1/X', '', $code);
// look for "function calls"
if ( preg_match_all('/(?>((?:(?<=\b)function\s+)?\w+)\s*)\(/Xi',
$code, $arr_matches) ) {
$result = $arr_matches[1];
}
return $result;
} // getFunctionCalls()
I recommend you:
-> http://php.net/pcre
> http://php.net/manual/en/reference.pcre.pattern.syntax.php
> http://php.net/manual/en/reference.pcre.pattern.modifiers.php
--
Atentamente,
J. Rafael Salazar Magaña
Innox - Innovación Inteligente
Tel: +52 (33) 3615 5348 ext. 205 / 01 800 2-SOFTWARE
http://www.innox.com.mx
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php