Re: Internationalisation and MB strings

Yeti <yeti@xxxxxxxxxx> · Fri, 1 Aug 2008 17:57:56 +0200

Oh right. Doing 1 measurement only is not even worth a theory.

Well, I'm wondering how much PHP can speed that result up, since we are
calling the same function with the same parameter 10000 times. Wouldn't it
be even more realistic if we called it with changing strings?

<?php

$iterations = 10000;
$mb_array = array();
for ($i = 0; $i < $iterations; ++$i) $mb_array[] = str_shuffle('œŸŒ‡Ņ');
$s_t = microtime(true);
foreach ($mb_array as $mb_string) {
 mb_strlen($test_string, 'UTF-8');
}
$e_t = microtime(true);
echo '<p>MB_STRLEN took : '.(($e_t - $s_t)*1000/$iterations).'
milliseconds</p>';

$s_t = microtime(true);
foreach ($mb_array as $mb_string) {
 strlen('œŸŒ‡Ņ');
}
$e_t = microtime(true);
echo '<p>STRLEN took : '.(($e_t - $s_t)*1000/$iterations).'
milliseconds</p>';

?>

MB_STRLEN took : 0.0525826 milliseconds

STRLEN took : 0.0020655 milliseconds

I could not find out how well str_shuffle supports multi byte strings in
PHP4, so I'm wondering if I did this right ..

On Fri, Aug 1, 2008 at 5:06 PM, Andrew Ballard <aballard@xxxxxxxxx> wrote:

> On Fri, Aug 1, 2008 at 9:50 AM, Yeti <yeti@xxxxxxxxxx> wrote:
> > <?php
> > *# Hello Community
> > # Internationalisation, a topic discussed more than enough and YES, I am
> > looking forward to PHP6.
> > # But in reality I still have to develop for PHP4 and that's where the
> dog
> > is burried ^^
> > # We have a customer here who is running a small site, but still in five
> > different languages.
> > # Lately he started complaining about some strange site behaviours:
> >
> > # He has a discussion board where people can post their ideas, comments
> etc.
> > Nothing special
> > # Every post has a maximum length of 2048 characters, which is checked by
> > JavaScript at the Browser
> > # and after submitting the form by PHP.
> >
> > # Our mistake was to use strlen();*
> > global $cc_strlen; global $cc_mb;
> > $cc_strlen = $cc_mb = 0;
> > if (array_key_exists('text', $_POST)) {
> >  $cc_strlen = strlen($_POST['text']);
> >  $cc_mb = mb_strlen($_POST['text'], 'UTF-8'); *// new code*
> >  if ($cc_strlen > 2048) { /* snip */ } // do something
> > }
> >
> > /* snip */ // do something
> >
> > *#this works fine as long as the user only submits single byte
> charachters,
> > but with UTF-8 the whole thing changes ..*
> > ?>
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
> > http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> > <html xmlns="http://www.w3.org/1999/xhtml";>
> > <head>
> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> > <title>test</title>
> > </head>
> > <body>
> > <p>You submitted <?php echo $cc_strlen; ?> characters (STRLEN).</p>
> > <p>You submitted <?php echo $cc_mb; ?> characters (MB_STRLEN).</p>
> > <p>Characters Left:<span id="remainder">2048</span></p>
> > <form action="" method="post" onsubmit="return false;" id="post_form">
> > <textarea id="post_text" name="text" onkeydown="check_length();"
> > onchange="check_length();" rows="10" cols="50">œŸŒ‡Ņ</textarea><br />
> > <input type="submit" value="Submit" id="post_button"
> > onclick="submit_form();" />
> > </form>
> > <script type="text/javascript">
> > <!--
> > var the_form = document.getElementById('post_form');
> > var textarea = document.getElementById('post_text');
> > var counter = document.getElementById('remainder');
> > function check_length() {
> >  var remainder = 2048 - textarea.value.length;
> >  var length_alert = false;
> >  if (remainder < 0) {
> >  remainder = 0;
> >  for (var count = textarea.value.length; (count >= 2048); (count -= 1)) {
> >  textarea.value = textarea.value.substr(0, 2047);
> >  counter.style.color = 'red'
> >  length_alert = true;
> >  }
> >  }
> >  if (length_alert) alert('You are already using 2048 characters.');
> >  if (document.all) {
> >  counter.innerText = remainder;
> >  } else {
> >  counter.textContent = remainder;
> >  }
> > }
> > function submit_form() {
> >  check_length();
> >  the_form.submit();
> >  alert ('You submitted ' + textarea.value.length + ' characters');
> >  return true;
> > }
> > -->
> > </script>
> > <?php
> > *# Now as soon as one is starting to submit UTF-8 characters strlen is
> not
> > working proberly any more
> > # So we had to work through thousands of lines of code, replacing
> strlen()
> > with mb_strlen();
> > # We also found mb_strlen to take about 8 times longer than strlen().*
> >
> > $s_t = microtime();
> > mb_strlen('œŸŒ‡Ņ', 'UTF-8');
> > $e_t = microtime();
> > echo '<p>MB_STRLEN took : '.(($e_t - $s_t)*1000).' milliseconds</p>';
> > $s_t = microtime();
> > strlen('œŸŒ‡Ņ');
> > $e_t = microtime();
> > echo '<p>STRLEN took : '.(($e_t - $s_t)*1000).' milliseconds</p>';
> >
> > *# So much for internationalisation.
> > # Just writing this as a reminder for everyone who is facing similar
> > situations.*
> > ?>
> > </body>
> > </html>
> >
>
> You can't determine timing by simply calling each function one time. I
> changed your script to the following:
>
> <?php
>
> $iterations = 10000;
>
> $s_t = microtime(true);
> for ($i = 0; $i < $iterations; ++$i) {
>     mb_strlen('œŸŒ‡Ņ', 'UTF-8');
> }
> $e_t = microtime(true);
> echo '<p>MB_STRLEN took : '.(($e_t - $s_t)*1000/$iterations).'
> milliseconds</p>';
>
> $s_t = microtime(true);
> for ($i = 0; $i < $iterations; ++$i) {
>    strlen('œŸŒ‡Ņ');
> }
> $e_t = microtime(true);
> echo '<p>STRLEN took : '.(($e_t - $s_t)*1000/$iterations).'
> milliseconds</p>';
>
> ?>
>
> I ran this script several times, and the results below are fairly typical:
>
> MB_STRLEN took : 0.054733037948608 milliseconds
>
> STRLEN took : 0.037568092346191 milliseconds
>
>
> The multi-byte function is slower, but not even by a factor of 2 on my
> development machine.
>
> Andrew
>