RE: regex question

"Murray @ PlanetThoughtful" <lists@xxxxxxxxxxxxxxxxxxxx> · Tue, 17 May 2005 08:56:51 +1000

> Try (for example if character was "A") ...
> 
> ([^A]|^)A([^A]|$)
> 
> This matches four cases:
> A is at beginning of string and there is another letter after it,
> A has a letter before it and a letter after it,
> A is at end of string and there is a letter before it,
> or A is the only character in the string.

I think this has the same problem that my first attempt at this regex
experienced.

I.e., it will correctly 'find' single instances of 'A', but it will 'match'
against unwanted characters on either side of each 'found' 'A' because they
are not-A.

For example, the following:

preg_match_all('/([^A]|^)A([^A]|$)/','A sentence with instAnces of AAA
chArActers', $thing, PREG_OFFSET_CAPTURE);

print_r($thing);

produces:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => A
                    [1] => 0
                )

            [1] => Array
                (
                    [0] => tAn
                    [1] => 19
                )

            [2] => Array
                (
                    [0] => hAr
                    [1] => 34
                )

        )

    [1] => Array
        (
            [0] => Array
                (
                    [0] =>
                    [1] => 0
                )

            [1] => Array
                (
                    [0] => t
                    [1] => 19
                )

            [2] => Array
                (
                    [0] => h
                    [1] => 34
                )

        )

    [2] => Array
        (
            [0] => Array
                (
                    [0] =>
                    [1] => 1
                )

            [1] => Array
                (
                    [0] => n
                    [1] => 21
                )

            [2] => Array
                (
                    [0] => r
                    [1] => 36
                )

        )

)

Note the multiple instances of characters other than 'A' in the array. Also
note that the 4th qualifying 'A' (the second 'A' in 'chArActers') is missed,
because the 'r' is already part of the capture of the preceding 'A').

On the other hand, the following:

preg_match_all(' /(?<!A)A(?!A)/','A sentence with instAnces of AAA
chArActers', $thing, PREG_OFFSET_CAPTURE);

print_r($thing);

produces:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => A
                    [1] => 0
                )

            [1] => Array
                (
                    [0] => A
                    [1] => 20
                )

            [2] => Array
                (
                    [0] => A
                    [1] => 35
                )

            [3] => Array
                (
                    [0] => A
                    [1] => 37
                )

        )

)

Here, only the target characters are matched, without the confusion of extra
unwanted characters. All 4 target 'A's are caught, because the patterns on
either side of the 'A' in the regex pattern are non-capturing. So, basically
this is employing a non-capturing negative look-behind and a non-capturing
negative look-ahead, rather than capturing negated character classes.

I've probably only managed to confuse things more than they were, but I'm
hoping some of what I've said above makes sense (to me, if no-one else).

Regards,

Murray

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php