Re: preg_replace with UTF-8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you Andrew,
That seems to break up UTF-8 strings. So from there I will play with it.

On Jul 6, 2009, at 8:50 AM, Andrew Ballard wrote:

On Sun, Jul 5, 2009 at 9:54 PM, SleePy<sleepingkiller@xxxxxxxxx> wrote:
I seem to be having a minor issue with preg_replace not working as expected when using UTF-8 strings. So far I have found out that \w doesn't seem to be
detecting UTF-8 strings.

This is my test php file:
<?php
$data = 'ooooooooooooooooooooooo';
echo 'Data before: ', $data, '<br />';

$data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

// UTF-8 Test
$data = 'ффффффффффффффффффффффф';
echo '<hr />Data before: ', $data, '<br />';

$data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

?>


I would expect it to be:
Data before: ooooooooooooooooooooooo
Data After: oooooo < >oooooo < >oooooo < >ooooo
---
Data before: ффффффффффффффффффффффф
Data After: фффффф <>фффффф <>фффффф<> ффффф

But what I get is:
Data before: ooooooooooooooooooooooo
Data After: oooooo < >oooooo < >oooooo < >ooooo
---
Data before: ффффффффффффффффффффффф
Data After: ффффффффффффффффффффффф

Did I go about this the wrong way or is this a php bug itself?
I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple weeks ago)
and received the same results.


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



From the manual on PCRE syntax:
"A 'word' character is any letter or digit or the underscore
character, that is, any character which can be part of a Perl 'word'.
The definition of letters and digits is controlled by PCRE's character
tables, and may vary if locale-specific matching is taking place. For
example, in the 'fr' (French) locale, some character codes greater
than 128 are used for accented letters, and these are matched by \w.

These character type sequences can appear both inside and outside
character classes. They each match one character of the appropriate
type. If the current matching point is at the end of the subject
string, all of them fail, since there is no character to match."

I'm not sure if this is exactly what you want (or if it might let more
things slip past than you intend), but try this:

<?php
$data = 'ooooooooooooooooooooooo';
echo 'Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

// UTF-8 Test
$data = 'ффффффффффффффффффффффф';
echo '<hr />Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

?>

Andrew


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux