Re: Adding text before last paragraph

"Dotan Cohen" <dotancohen@xxxxxxxxx> · Wed, 29 Aug 2007 02:52:13 +0300

On 28/08/07, Brian Rue <brianrue@xxxxxxxxx> wrote:
> Sure, I'll break it apart a little:

Er, wow, thanks. Lots of material here...

> '{(?=<p(?:>|\s)(?!.*<p(?:>|\s)))}is'
>
> $regex = '{' .     // opening delimeter
>          '(?=' .   // positive lookahead: match the beginning of a position
>                    // that matches the following pattern:
>              '<p' .  // first part of an opening <p> tag
>                  '(?:' . // non-capturing parenthesis (same as normal
>                          // parenthesis, but a bit faster since we don't
>                          // need to capture what they match for use later
>                  '>|\s' . // match a closing > or a space
>                  ')' . // end capturing paranthesis
>                  '(?!' . // negative lookahead: the match will fail if the
> //following pattern matches from the current position
>                  '.*' .  // match until the end of the string
>                  '<p(?:>|\s)' . // same as above - look for another <p> tag
>                  ')' .  // end negative lookahead
>          ')' .      // end positive lookahead
>          '}is';   // ending delimeter, and use modifiers s and i

It was the negative lookahead that confused me, I see. The rest seems
pretty straightforward. Difficult, but straightforward.

>
> About the modifiers: i makes it case-insensitive, and s turns on
> dot-matches-all-mode (including newlines)--otherwise, the . would only match
> until the next newline.

Yes, this I know.

> The regex has two parts: matching a <p> tag, and then making sure there
> aren't any more <p> tags in the string following it. The positive lookahead
> is (hopefully) pretty straightforward. The negative lookahead works by using
> a greedy (regular) .*, which forces the regex engine to match all the way to
> the end of the haystack. Then it encounters the <p(?:>\s) part, forcing it
> to backtrack until it finds a <p> tag. If it doesn't find one before
> returning to the 'current' position (directly after the <p> tag we just
> matched), then we know we have found the last <p> tag.

Nice. Very nice.

> The positive and negative lookahead are 'zero-width' requirements, which
> means they don't advance the regex engine's pointer in the haystack string.
> Since the entire regex is zero-width, the replacement string gets inserted
> at the matched position.

Hmm.

> I hope that made at least a little bit of sense :) If you're doing a lot of
> regex work, I would strongly recommend reading the book Mastering Regular
> Expressions by Jeffrey Friedl... it's very well written and very helpful.

I don't do a lot, but it's a great tool to know when one needs it!
Thank you for the patient explanations.

Just a general note, both these addresses are 404 right now:
http://il.php.net/manual/en/pcre.pattern.modifiers.php
http://uk.php.net/manual/en/pcre.pattern.syntax.php

Dotan Cohen

http://lyricslist.com/
http://what-is-what.com/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php