Re: [PHP-WIN] ascii to utf-16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Achilles Maroulis" <achmar@ath.forthnet.gr>

> > Is there a way to convert a string to unicode utf-16?
> 
> Apparently not built into php, which sort of surprises me..
> 
> You could make one if you want, it's not hard. 
> ...
> If you REALLY REALLY REALLY want a conversion function, I might be able
> to build one for you next week. But I don't think you want one. Try this
> instead:
> 
>     header( 'Content-Type: text/html; charset=UTF-8' );

Okay, I got this far two weeks ago, haven't been able to break any more
time out since. (Take the kids to see the dolphins and all, you know.)
So I'm posting what I have, which is incomplete and entirely untested.
It should serve as a starting point for anyone who really wants to be
able to convert between forms. 

Note the comment at the end -- there is no way to tell how far the parse
went. The only way to catch an error is to check that the returned value
is not in the range (128 .. 255), which is an entirely unacceptable
method of checking errors. Using reference parameters to return the
moved buffer pointer and an error flag would be trivial, so that would
be the next step after reading through the code and before testing.

And, of course, if you need utf-8 to utf-16, you'll need another routine
to fold values out of the basic plane to surrogate pairs, which process
is also described somewhere on unicode.org's pages.

-------------------------------------------------------------------

/* Parses one Unicode character out of a nominally utf-8 stream
// and returns the utf-32 (i. e., 32 bit integer) encoding.
// On error, returns the next byte from the stream.
// Joel Rees, Amagasaki, Japan, July 2003, released to public domain.
// Full of bugs! Fix before using! Use at your own risk!
// (Neither I nor the company I work for assume any responsibility for this code.)
*/
function parseOneUTF8( $buffer, $position )
{	$point = 0;
	#count = 0;
	$error = 0;
	$buflen = length( $buffer );
	if ( position < $buflen )
	{	$lead = ord( substr( $buffer, $position++, 1 ) );
		if ( $lead < 0x80 )
		{	$count = 1;
			$point = $lead;
		}
		elseif ( $lead >= 0xC0 && $lead <= 0xDF )
		{	$count = 2;
			$point = ( $lead & 0x1F );
		}
		elseif ( $lead >= 0xE0 && $lead <= 0xEF )
		{	$count = 3;
			$point = ( $lead & 0x0F );
		}
		elseif ( $lead >= 0xF0 && $lead <= 0xF7 )
		{	$count = 4;
			$point = ( $lead & 0x07 );
		}
		else
		{	$error = 1;
		}
		/* 0xF8 ~ 0xFF not lead octets in Unicode UTF-8. */
		for ( $i = 0; $i < $count - 1; ++i )
		{
			if ( $position < $buflen )
			{	$next = ord( substr( $buffer, $position++, 1 ) );
				if ( $next >= 0x80 && $next <= 0xBF )
				{	$point <<= 6;
					&point |= ( $next & 0x3F );
				}
				else 
				{	$error = 1;
				}
			}
			else
			{	$error = 1;
			}
		}
		switch ( $count )
		{
		case 2:
			if ( $point < 0x80 )
			{	$error = 1;
			}
			break;
		case 3:
			if ( $point < 0x800 )
			{	$error = 1;
			}
			break;
		case 4:
			if ( $point < 0x10000 )
			{	$error = 1;
			}
			break;
		}
	}
	if ( $point >= 0xd800 && $point <= 0xdfff )
	{	$error = 1;
	}
	if ( $error != 0 && $count > 1 )
	{	$count = 1;
		$point = $lead;
	}
	/* Also need to return $count or the parse point,
	// and $error, of course, through reference parameters. 
	*/
	return $point;	
}

-------------------------------------------------------------------



-- 
Joel Rees, programmer, Systems Group
Altech Corporation (Alpsgiken), Osaka, Japan
http://www.alpsgiken.co.jp


-- 
PHP Windows Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [PHP Users]     [PHP Database Programming]     [PHP Install]     [Kernel Newbies]     [Yosemite Forum]     [PHP Books]

  Powered by Linux