On Sun, Jul 5, 2009 at 9:54 PM, SleePy<sleepingkiller@xxxxxxxxx> wrote: > I seem to be having a minor issue with preg_replace not working as expected > when using UTF-8 strings. So far I have found out that \w doesn't seem to be > detecting UTF-8 strings. > > This is my test php file: > <?php > $data = 'ooooooooooooooooooooooo'; > echo 'Data before: ', $data, '<br />'; > > $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data); > echo 'Data After: ', $data; > > // UTF-8 Test > $data = 'ффффффффффффффффффффффф'; > echo '<hr />Data before: ', $data, '<br />'; > > $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data); > echo 'Data After: ', $data; > > ?> > > > I would expect it to be: > Data before: ooooooooooooooooooooooo > Data After: oooooo < >oooooo < >oooooo < >ooooo > --- > Data before: ффффффффффффффффффффффф > Data After: фффффф <>фффффф <>фффффф<> ффффф > > But what I get is: > Data before: ooooooooooooooooooooooo > Data After: oooooo < >oooooo < >oooooo < >ooooo > --- > Data before: ффффффффффффффффффффффф > Data After: ффффффффффффффффффффффф > > Did I go about this the wrong way or is this a php bug itself? > I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple weeks ago) > and received the same results. > > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > >From the manual on PCRE syntax: "A 'word' character is any letter or digit or the underscore character, that is, any character which can be part of a Perl 'word'. The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the 'fr' (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w. These character type sequences can appear both inside and outside character classes. They each match one character of the appropriate type. If the current matching point is at the end of the subject string, all of them fail, since there is no character to match." I'm not sure if this is exactly what you want (or if it might let more things slip past than you intend), but try this: <?php $data = 'ooooooooooooooooooooooo'; echo 'Data before: ', $data, '<br />'; $data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data); echo 'Data After: ', $data; // UTF-8 Test $data = 'ффффффффффффффффффффффф'; echo '<hr />Data before: ', $data, '<br />'; $data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data); echo 'Data After: ', $data; ?> Andrew -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php