"Achilles Maroulis" <achmar@ath.forthnet.gr> > > Is there a way to convert a string to unicode utf-16? > > Apparently not built into php, which sort of surprises me.. > > You could make one if you want, it's not hard. > ... > If you REALLY REALLY REALLY want a conversion function, I might be able > to build one for you next week. But I don't think you want one. Try this > instead: > > header( 'Content-Type: text/html; charset=UTF-8' ); Okay, I got this far two weeks ago, haven't been able to break any more time out since. (Take the kids to see the dolphins and all, you know.) So I'm posting what I have, which is incomplete and entirely untested. It should serve as a starting point for anyone who really wants to be able to convert between forms. Note the comment at the end -- there is no way to tell how far the parse went. The only way to catch an error is to check that the returned value is not in the range (128 .. 255), which is an entirely unacceptable method of checking errors. Using reference parameters to return the moved buffer pointer and an error flag would be trivial, so that would be the next step after reading through the code and before testing. And, of course, if you need utf-8 to utf-16, you'll need another routine to fold values out of the basic plane to surrogate pairs, which process is also described somewhere on unicode.org's pages. ------------------------------------------------------------------- /* Parses one Unicode character out of a nominally utf-8 stream // and returns the utf-32 (i. e., 32 bit integer) encoding. // On error, returns the next byte from the stream. // Joel Rees, Amagasaki, Japan, July 2003, released to public domain. // Full of bugs! Fix before using! Use at your own risk! // (Neither I nor the company I work for assume any responsibility for this code.) */ function parseOneUTF8( $buffer, $position ) { $point = 0; #count = 0; $error = 0; $buflen = length( $buffer ); if ( position < $buflen ) { $lead = ord( substr( $buffer, $position++, 1 ) ); if ( $lead < 0x80 ) { $count = 1; $point = $lead; } elseif ( $lead >= 0xC0 && $lead <= 0xDF ) { $count = 2; $point = ( $lead & 0x1F ); } elseif ( $lead >= 0xE0 && $lead <= 0xEF ) { $count = 3; $point = ( $lead & 0x0F ); } elseif ( $lead >= 0xF0 && $lead <= 0xF7 ) { $count = 4; $point = ( $lead & 0x07 ); } else { $error = 1; } /* 0xF8 ~ 0xFF not lead octets in Unicode UTF-8. */ for ( $i = 0; $i < $count - 1; ++i ) { if ( $position < $buflen ) { $next = ord( substr( $buffer, $position++, 1 ) ); if ( $next >= 0x80 && $next <= 0xBF ) { $point <<= 6; &point |= ( $next & 0x3F ); } else { $error = 1; } } else { $error = 1; } } switch ( $count ) { case 2: if ( $point < 0x80 ) { $error = 1; } break; case 3: if ( $point < 0x800 ) { $error = 1; } break; case 4: if ( $point < 0x10000 ) { $error = 1; } break; } } if ( $point >= 0xd800 && $point <= 0xdfff ) { $error = 1; } if ( $error != 0 && $count > 1 ) { $count = 1; $point = $lead; } /* Also need to return $count or the parse point, // and $error, of course, through reference parameters. */ return $point; } ------------------------------------------------------------------- -- Joel Rees, programmer, Systems Group Altech Corporation (Alpsgiken), Osaka, Japan http://www.alpsgiken.co.jp -- PHP Windows Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php