Re: stripping enclosed text from PHP code

Rafael <rsalazar@xxxxxxxxxxxx> · Mon, 10 Apr 2006 02:06:47 -0500

Winfried Meining wrote:
I am writing on a script that parses a PHP script and finds all function calls 
to check, if these functions exist. To do this, I needed a function that would 
strip out all text, which is enclosed in apostrophes or quotation marks. This 
is somewhat tricky, as the script needs to be aware of what really is an 
enclosed text and what is PHP code. Apostrophes in quotation mark enclosed text 
should be ignored and quotation marks in apostrophe enclosed text should be 
ignored, as well. Similarly, escaped apostrophes in apostrophe enclosed text 
and escaped quotation marks in quotation mark enclosed text should be ignored. 

The following function uses preg_match to do this job. 
[···]
I noticed that this function is very slow, in particular because 

preg_match("/^(.*)some_string(.*)$/", $text, $matches);

always seems to find the *last* occurrence of some_string and not the *first* 
(I would need the first). There is certainly a way to write another version 
where one looks at every single character when going through $text, but this 
would make the code much more complex.

	IIRC regexp search from left to right but match from right to left, 
hence going trough the string while the first part matches and going 
backwards every time the next part fails to match, and so on for the 
whole expression, that's why "(?>...)", "once-only subpatterns", exists for.

	Now, you're telling (from left to right) to look for "any sequence of 
any characters followed by 'some_string' and, once again, followed by 
any sequence of any characters".  As long as "(.*)" matches it will loop 
the whole string till it fails, once it does it will try to match 
"some_text", if it doesn't it will try to match "some_string" once again 
from the current position minus one and so on, until it matches (or 
"(.*)" fails --at the beginning of the string), after "some_string" 
matches it will repeat the first step for the second "(.*)" on the 
pattern, so the process will be quite slow.
--I hope this has had some sense for you (somehow it lost it for me)

	Also, by default regexp are "greedy", which means "+" and "*" 
meta-characters will go on and on.  In your case, you most likely will 
need to "limit" this behaviour by specifying them as "ungreedy" (that 
is, it will try to match the next part after each "+"/"*" matches), you 
can do this adding a "?" after these meta-characters (e.g. ".+?-")

I wonder, if there is a faster *and* simple way to do the same thing. 

	Mhh...  what about something like
  preg_replace('/(["\']).*?(?<!\\\)\\1/X', '', $code)
? it's not 100% accurate, though, if you find something like '\\' it 
will fail --I guess you should replace these before running the regexp

	After trying a little, I found that this code below seems to be quite 
acceptable, you may want to try it:
  /**
   * Returns an array with the identified function-calls found
   * (including "function declarations", e.g. "function my_func")
   *
   * @param     string  $code
   * @return    string
   * @since     Mon Apr 10 01:13:28 CDT 2006
   * @author    rsalazar
   */
  function getFunctionCalls( $code ) {
      $result = FALSE;
      // try to strip away literal strings
      $code   = preg_replace('/(["\']).*?(?<!\\\)\\1/X', '', $code);
      // look for "function calls"
      if ( preg_match_all('/(?>((?:(?<=\b)function\s+)?\w+)\s*)\(/Xi',
                          $code, $arr_matches) ) {
          $result = $arr_matches[1];
      }
      return  $result;
  } // getFunctionCalls()

I recommend you:
-> http://php.net/pcre
 > http://php.net/manual/en/reference.pcre.pattern.syntax.php
 > http://php.net/manual/en/reference.pcre.pattern.modifiers.php
--
Atentamente,
J. Rafael Salazar Magaña
Innox - Innovación Inteligente
Tel: +52 (33) 3615 5348 ext. 205 / 01 800 2-SOFTWARE
http://www.innox.com.mx

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php