Hi everyone (and especially Benjamin Kosnik if you are listening), I have a stupid C++ question. I've spun my wheels for days on this issue. I've been lost in a morass of std:: ... basic_istream, basic_ostream, basic_istringstream, basic_ostringstream, locale, facet, and codecvt. A simplified version of my problem: I have these files: utf8.txt utf8-w-bom.txt *** utf16le-w-bom.txt utf16be-w-bom.txt utf16le-wo-bom.txt *** utf16be-wo-bom.txt * utf32le-w-bom.txt utf32be-w-bom.txt utf32le-wo-bom.txt *** utf32be-wo-bom.txt * I want to read in those files into big behemoth strings. Strings that are IN MEMORY utf8, or utf16, or utf32 at my discretion (i.e., programmatically). I want to write out those Unicode strings into new Unicode files. Now, granted, some of them are peculiar, as indicated with ***. The * mark ones that the "without BOM" behavior is supposed to presume big-endian, so they are copacetic. I want to write out those files from those strings, but not necessarily UTF8 to UTF8. I want to go from anything (UTF8, 16, 32) to anything (UTF8, 16, 32). Pseudo code example: ----------------------------------------- #include <stdint.h> // C99-ism #include <iostream> #include <sstream> #include <string> class Utf8Char { uint8_t m; public: explicit Utf8Char(char in) : m(in) { } operator uint8_t () const { return m; } }; class Utf16Char { uint16_t m; public: explicit Utf8Char(char in) : m(in) { } operator uint16_t () const { return m; } }; class Utf32Char { uint32_t m; public: explicit Utf8Char(char in) : m(in) { } operator uint32_t () const { return m; } }; typedef std::basic_string<Utf8Char> Uft8String; typedef std::basic_string<Utf16Char> Uft16String; typedef std::basic_string<Utf32Char> Uft32String; typedef std::basic_istream<Utf8Char> istream8; typedef std::basic_istream<Utf16Char> istream16; typedef std::basic_istream<Utf32Char> istream32; typedef std::basic_ostream<Utf8Char> ostream8; typedef std::basic_ostream<Utf16Char> ostream16; typedef std::basic_ostream<Utf32Char> ostream32; typedef std::basic_istringstream<Utf8Char> istringstream8; typedef std::basic_istringstream<Utf16Char> istringstream16; typedef std::basic_istringstream<Utf32Char> istringstream32; typedef std::basic_ostringstream<Utf8Char> ostringstream8; typedef std::basic_ostringstream<Utf16Char> ostringstream16; typedef std::basic_ostringstream<Utf32Char> ostringstream32; ----------------------------------------- BUT... none of that works. At all. I'm completely dazed and confused. I can't even get this to work (COMPILABLE example with GCC 4.0, so I didn't break my own often given advice on this forum)... ----------------------------------------- #include <ios> #include <iostream> #include <sstream> #include <ext/stdio_filebuf.h> // Following Stroustrup's 11.7.1 advice... class Utf16Char { public: Utf16Char() : c(0) { } Utf16Char(unsigned short int in) : c(in) { } operator unsigned short int () const { return c; } private: unsigned short int c; // UTF16. }; typedef std::basic_ostream<Utf16Char> uostream; int main() { __gnu_cxx::stdio_filebuf<Utf16Char> buf_ucerr(stderr, std::ios_base::out); uostream ucerr(&buf_ucerr); ucerr.flags(std::ios_base::unitbuf); // Comingling cerr and ucerr output isn't going to really work. // At this nascent stage, this is just for show-and-tell. std::cerr << (ucerr.good() ? "ucerr is good" : "ucerr is not good") << std::endl; // Prints: ucerr is good. ucerr << Utf16Char(0xFEFF); // BOM, to kick things off. // Where is my FF FE hex bytes output? for(int i = 0; i < 1000; ++i) ucerr << Utf16Char('x'); // Where are my 00 78 hex bytes on output? // Heck, where is ANY of the output going? // Oh, gdb says ucerr is in a bad state. // But why? // What did I miss? // How can I fix it? std::cerr << (ucerr.good() ? "ucerr is good" : "ucerr is not good") << std::endl; // Prints: ucerr is not good. } ----------------------------------------- My immediate goal is to understand how basic_string and basic_istream and basic_ostream can make my life easier. Then I want to be able to write a little program that does this: $ unicat --help unicat [--utfX] [--Xbom] [-o file] [-i | [--] files...] --utf8 output utf8 (default) --utf16le output utf16le --utf16be output utf16be --utf32le output utf32le --utf32be output utf32be --bom output bom (even if incorrect) --nobom suppress bom (even if required) --autobom does the right thing (default) -o file output file, otherwise stdout -i input from stdin, not files... -- subsequent parms are files... files... any Unicode encoded format (I already have a little program that does this, but it is written using a little state machine and regular ifstream and ofstream on a byte-by-byte basis. My goal is to understand std::basic_string/stream, not to make this trivial Unicode text concatenation program.) Does anyone grok this C++ (and GCC) string and stream magic and the bewildering locale, facet, codecvt -- and how to get it to work with a variety of Unicode encoded inputs, in memory Unicode encodings, and Unicode encoded outputs? NOTE: I *must* stay away from char and wchar_t. They are insufficiently portable and reliable for my needs. HELP! Insights, understandings, explanations, enlightenments welcome, --Eljay