Re: AFAIK, Microsoft C runtime library does not support UTF-8, Actually, here is a clip from the runtime library source code: tmode = _textmode(fh); switch(tmode) { case __IOINFO_TM_UTF8 : /* For a UTF-8 file, we need 2 buffers, because after reading we need to convert it into UNICODE - MultiByteToWideChar doesn't do in-place conversions. */ /* MultiByte To WideChar conversion may double the size of the buffer required & hence we divide cnt by 2 */ /* * Since we are reading UTF8 stream, cnt bytes read may vary * from cnt wchar_t characters to cnt/4 wchar_t characters. For * this reason if we need to read cnt characters, we will * allocate MBCS buffer of cnt. In case cnt is 0, we will * have 4 as minimum value. This will make sure we don't * overflow for reading from pipe case. * * * In this case the numbers of wchar_t characters that we can * read is cnt/2. This means that the buffer size that we will * require is cnt/2. */ /* For UTF8 we want the count to be an even number */ This is in the _read(fd, buffer, count) function, and shows that it will in fact read UTF-8 and automatically transform it to UTF-16LE transparently. The documentation for _open explains this feature. Meanwhile, a quick look at _mbslen() etc. shows that they are implemented, and will handle UTF-8 encoded text as variable-length char* just fine as long as suitable tables are loaded in its locale. An internal header shows macros for generating the lead-byte information as needed by that table. Now, the default when a program starts is to use the "C" locale. The locale argument to setlocale can take a form ".code_page", so calling setlocale (LC_CTYPE, ".65001"); should do the trick. Assuming, that is, that you don't hit macros that assume that characters are never multibyte. So define the preprocessor symbol _MBCS when you compile. Older versions might not work right because MBCS (multibyte character strings) was only actually implemented to DBCS (double-byte). That is, a single lead byte would be followed by a second byte, and no other cases are provided for. But, GB18030 has up to 4 bytes in a single character. It might still not be completely "clean" though because GB18030 has a "double double" nature to it. Just like assuming 16-bit characters period mostly works with surrogate pairs even if you didn't code full UTF-16 support, DBCS code will see a 4-byte GB18030 character as two double byte characters. So it gets the len (in characters) wrong, and might still break up what is supposed to be a single character. So it really needs some improvement from the historical DBCS-only code to work properly. Anyway, if UTF-8 really doesn't work with MBCS functions acceptably well, and the goal is to allow passage of all characters through the program, then set the program to use Chinese. GB18030 is =fully= supported and is just another (albeit strange) encoding for Unicode. As for what fprintf (stderr, "unable to open %s", path); will do, it will have no problem copying the contents of path to the output stream no matter how it is encoded. The result will be sent to stderr, which may be autotranslating the local code page to UTF-16 or UTF-8, but by default just feeds the stream of bytes to the console window's 8-bit API, which has its own code page setting. Personally, I have printf'ed UTF-8 encoded text to standard output. It looks OK if the console is also set to UTF-8. --John (please excuse the footer; it's not my idea) TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html