Re: Parsing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> So, what does (.*?) mean? Well, simply said "any character, occuring 0 or more times" occuring 0 or 1 times.

I don't think so.  ((.*)?) would mean that, but in (.*?), the '?' means "make the preceding pattern non-greedy; that is, make it match the minimum number of times. And as the minimum number of matches of (.*) is zero, it ends up meaning 'match no character at all. So it will always be true, wherever it occurs in a match string.

For instance,

$ php -a
Interactive shell

php > $test = "aabbcc";
php > $re = '/.+?(bb?).*/';
php > preg_match($re, $test, $match);
php > print_r($match);
Array
(
    [0] => aabbcc
    [1] => bb
)
  Note here that the initial pattern piece '.+?' is limited to the minimum match.
  The minimum match is a single character, but that is overruled by the attempt to match
  capturing sub-expression '(bb?)' so it in fact matches 'aa'.  Note that in
  this regexp, (bb?) means "...a 'b' char followed by zero or one 'b' chars."
  Now change that initial sub-expression.
php > $re = '/.*?(bb?).*/';
php > preg_match($re, $test, $match);
php > print_r($match);
Array
(
    [0] => aabbcc
    [1] => bb
)
  The minimum match here is no characters, constrained to the minimum.  But again,
  the minimum match must be extended to accommodate '(bb?)'.
  Now remove the minimising constraint.
php > $re = '/.*(bb?).*/';
php > preg_match($re, $test, $match);
php > print_r($match);
Array
(
    [0] => aabbcc
    [1] => b
)
  Only one 'b'!  Which 'b' is matched?  It's the second 'b'.  The minimum match
  for (bb?) is a single 'b' followed by zero 'b's; so the second 'b' satisfies
  the capturing expression, and the now-greedy initial subexpression can
  gobble up all of the character to that second 'b'.

  Don't believe me?
php > $re = '/(.*)(bb?).*/';
php > preg_match($re, $test, $match);
php > print_r($match);
Array
(
    [0] => aabbcc
    [1] => aab
    [2] => b
)
  Let's back up.
php > $re = '/(.*?)(bb?).*/';
php > preg_match($re, $test, $match);
php > print_r($match);
Array
(
    [0] => aabbcc
    [1] => aa
    [2] => bb
)
  As before, with a non-greedy initial sub-expression,
  except that we now capture that initial sub-expression.
  (bb?) means "...a 'b' followed by zero or one 'b's, greedily.
  Can we force that to be non-greedy?
php > $re = '/(.*?)(bb??)(.*)/';
php > preg_match($re, $test, $match);
php > print_r($match);
Array
(
    [0] => aabbcc
    [1] => aa
    [2] => b
    [3] => bcc
)
  Yes we can, by appending a moderating '?' which curbs the
  appetite of the capturing sub-expression: (bb??)

Peter West
"...and behold, something greater than Jonah is here."

> On 23 Feb 2015, at 10:48 pm, Maciek Sokolewicz <maciek.sokolewicz@xxxxxxxxx> wrote:
> 
> Secondly, the above two regexp rules are slightly bloated. What they actually mean is:
> ( = start new catchable pattern
> . = any character
> * = 0 or more of the previous pattern
> ? = 0 or 1 of the previous pattern
> ) = end catchable pattern
> \s = any whitespace character
> 
> So, what does (.*?) mean? Well, simply said "any character, occuring 0 or more times" occuring 0 or 1 times. But since the any character pattern already occurs 0 or more times, the pattern as a whole will either be matched (1 time) or not (0 times). Making the ? metacharacter useless. Now if it were (.+?) then it would state "one or more of any character, with the entire pattern optional.
> So in practice, the following patterns are equal in what they represent: (.*?), (.*), (.+?)
> 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php






[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux