Re: preg_replace with UTF-8

Andrew Ballard <aballard@xxxxxxxxx> · Mon, 6 Jul 2009 11:50:57 -0400

On Sun, Jul 5, 2009 at 9:54 PM, SleePy<sleepingkiller@xxxxxxxxx> wrote:
> I seem to be having a minor issue with preg_replace not working as expected
> when using UTF-8 strings. So far I have found out that \w doesn't seem to be
> detecting UTF-8 strings.
>
> This is my test php file:
> <?php
> $data = 'ooooooooooooooooooooooo';
> echo 'Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> // UTF-8 Test
> $data = 'ффффффффффффффффффффффф';
> echo '<hr />Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> ?>
>
>
> I would expect it to be:
> Data before: ooooooooooooooooooooooo
> Data After: oooooo < >oooooo < >oooooo < >ooooo
> ---
> Data before: ффффффффффффффффффффффф
> Data After: фффффф <>фффффф <>фффффф<> ффффф
>
> But what I get is:
> Data before: ooooooooooooooooooooooo
> Data After: oooooo < >oooooo < >oooooo < >ooooo
> ---
> Data before: ффффффффффффффффффффффф
> Data After: ффффффффффффффффффффффф
>
> Did I go about this the wrong way or is this a php bug itself?
> I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple weeks ago)
> and received the same results.
>
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

>From the manual on PCRE syntax:
"A 'word' character is any letter or digit or the underscore
character, that is, any character which can be part of a Perl 'word'.
The definition of letters and digits is controlled by PCRE's character
tables, and may vary if locale-specific matching is taking place. For
example, in the 'fr' (French) locale, some character codes greater
than 128 are used for accented letters, and these are matched by \w.

These character type sequences can appear both inside and outside
character classes. They each match one character of the appropriate
type. If the current matching point is at the end of the subject
string, all of them fail, since there is no character to match."

I'm not sure if this is exactly what you want (or if it might let more
things slip past than you intend), but try this:

<?php
$data = 'ooooooooooooooooooooooo';
echo 'Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

// UTF-8 Test
$data = 'ффффффффффффффффффффффф';
echo '<hr />Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

?>

Andrew

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php