Dave, Thanks for the info. Did you find this on the Web? I searched for a long time and couldn't find it, and what information I gad was incorrect. If it is on the Web, what is the URL? John On Sat, 17 Apr 2004, Dave Mielke wrote: > [quoted lines by John J. Boyer on 2004/04/17 at 09:13 -0500] > > >For one of my projects I need to convert UTF-8 to ?UTF-32. However, I > >can't find information on which bits are set in the various bytes of a > >multi-byte UTI-8 character. > > 0X00 through 0X7F are literal, i.e. single-byte characters. > > If bit 7 is set and bit 6 is clear, i.e. the range 0X80 through 0XBF, it's a > continuation byte containing six more bits. The first byte of a multi-byte > character is never within this range. > > If bits 7 and 6 are set but bit 5 isn't, i.e. the range 0XC0 through 0XDF, then > it's the first 5 bits of a two-byte character. The resultant value is an > 11-bit character in the range 0 through 0X7FF. > > Each time the first clear bit is moved one position to the right the length of > the multi-byte character increases by one byte and the number of leading bits > in the first byte decreases by 1. Every non-leading byte, as mentioned above, > has bit 7 set and bit 6 clear, i.e. is within the range 0X80 through 0XBF, and > appends six bits to the value. Here's a table to illustrate: > > First RangeOf NumOf Init Totl MaxUnicode > 0-Bit FirstByte Bytes Bits Bits Character > 7 0X00 0X7F 1 7 7 0X0000007F > 5 0XC0 0XDF 2 5 11 0X000007FF > 4 0XE0 0XEF 3 4 16 0X0000FFFF > 3 0XF0 0XF7 4 3 21 0X001FFFFF > 2 0XF8 0XFB 5 2 26 0X03FFFFFF > 1 0XFC 0XFD 6 1 31 0X7FFFFFFF > > -- John J. Boyer; Executive Director, Chief Software Developer Computers to Help People, Inc. http://www.chpi.org 825 East Johnson; Madison, WI 53703 _______________________________________________ Blinux-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/blinux-list