Unicode and C++ (GCC 4.0)

Eljay Love-Jensen <eljay@xxxxxxxxx> · Thu, 04 Aug 2005 10:00:16 -0500

Hi everyone (and especially Benjamin Kosnik if you are listening),

I have a stupid C++ question.  I've spun my wheels for days on this issue.

I've been lost in a morass of std:: ... basic_istream, basic_ostream,
basic_istringstream, basic_ostringstream, locale, facet, and codecvt.

A simplified version of my problem:

I have these files:

utf8.txt
utf8-w-bom.txt ***
utf16le-w-bom.txt
utf16be-w-bom.txt
utf16le-wo-bom.txt ***
utf16be-wo-bom.txt *
utf32le-w-bom.txt
utf32be-w-bom.txt
utf32le-wo-bom.txt ***
utf32be-wo-bom.txt *

I want to read in those files into big behemoth strings.  Strings that are
IN MEMORY utf8, or utf16, or utf32 at my discretion (i.e.,
programmatically).  I want to write out those Unicode strings into new
Unicode files.

Now, granted, some of them are peculiar, as indicated with ***.  The * mark
ones that the "without BOM" behavior is supposed to presume big-endian, so
they are copacetic.

I want to write out those files from those strings, but not necessarily UTF8
to UTF8. I want to go from anything (UTF8, 16, 32) to anything (UTF8, 16,
32).

Pseudo code example:

-----------------------------------------
#include <stdint.h> // C99-ism
#include <iostream>
#include <sstream>
#include <string>

class Utf8Char
{
  uint8_t m;
public:
  explicit Utf8Char(char in) : m(in) { }
  operator uint8_t () const { return m; }
};

class Utf16Char
{
  uint16_t m;
public:
  explicit Utf8Char(char in) : m(in) { }
  operator uint16_t () const { return m; }
};

class Utf32Char
{
  uint32_t m;
public:
  explicit Utf8Char(char in) : m(in) { }
  operator uint32_t () const { return m; }
};

typedef std::basic_string<Utf8Char> Uft8String;
typedef std::basic_string<Utf16Char> Uft16String;
typedef std::basic_string<Utf32Char> Uft32String;

typedef std::basic_istream<Utf8Char> istream8;
typedef std::basic_istream<Utf16Char> istream16;
typedef std::basic_istream<Utf32Char> istream32;

typedef std::basic_ostream<Utf8Char> ostream8;
typedef std::basic_ostream<Utf16Char> ostream16;
typedef std::basic_ostream<Utf32Char> ostream32;

typedef std::basic_istringstream<Utf8Char> istringstream8;
typedef std::basic_istringstream<Utf16Char> istringstream16;
typedef std::basic_istringstream<Utf32Char> istringstream32;

typedef std::basic_ostringstream<Utf8Char> ostringstream8;
typedef std::basic_ostringstream<Utf16Char> ostringstream16;
typedef std::basic_ostringstream<Utf32Char> ostringstream32;
-----------------------------------------

BUT... none of that works.  At all.

I'm completely dazed and confused.

I can't even get this to work (COMPILABLE example with GCC 4.0, so I didn't
break my own often given advice on this forum)...

-----------------------------------------
#include <ios>
#include <iostream>
#include <sstream>
#include <ext/stdio_filebuf.h>

// Following Stroustrup's 11.7.1 advice...
class Utf16Char
{
public:
  Utf16Char() : c(0) { }
  Utf16Char(unsigned short int in) : c(in) { }
  operator unsigned short int () const { return c; }
private:
  unsigned short int c; // UTF16.
};

typedef std::basic_ostream<Utf16Char> uostream;

int main()
{
  __gnu_cxx::stdio_filebuf<Utf16Char> buf_ucerr(stderr, std::ios_base::out);
  uostream ucerr(&buf_ucerr);
  ucerr.flags(std::ios_base::unitbuf);

  // Comingling cerr and ucerr output isn't going to really work.
  // At this nascent stage, this is just for show-and-tell.
  std::cerr
    << (ucerr.good() ? "ucerr is good" : "ucerr is not good")
    << std::endl;
  // Prints: ucerr is good.

  ucerr << Utf16Char(0xFEFF); // BOM, to kick things off.
  // Where is my FF FE hex bytes output?

  for(int i = 0; i < 1000; ++i)
    ucerr << Utf16Char('x');
  // Where are my 00 78 hex bytes on output?
  // Heck, where is ANY of the output going?
  // Oh, gdb says ucerr is in a bad state.
  // But why?
  // What did I miss?
  // How can I fix it?

  std::cerr
    << (ucerr.good() ? "ucerr is good" : "ucerr is not good")
    << std::endl;
  // Prints: ucerr is not good.
}
-----------------------------------------

My immediate goal is to understand how basic_string and basic_istream and
basic_ostream can make my life easier.

Then I want to be able to write a little program that does this:

$ unicat --help
unicat [--utfX] [--Xbom] [-o file] [-i | [--] files...]
 --utf8     output utf8 (default)
 --utf16le  output utf16le
 --utf16be  output utf16be
 --utf32le  output utf32le
 --utf32be  output utf32be
 --bom      output bom (even if incorrect)
 --nobom    suppress bom (even if required)
 --autobom  does the right thing (default)
 -o file   output file, otherwise stdout
 -i        input from stdin, not files...
 --        subsequent parms are files...
 files...  any Unicode encoded format

(I already have a little program that does this, but it is written using a
little state machine and regular ifstream and ofstream on a byte-by-byte
basis.  My goal is to understand std::basic_string/stream, not to make this
trivial Unicode text concatenation program.)

Does anyone grok this C++ (and GCC) string and stream magic and the
bewildering locale, facet, codecvt -- and how to get it to work with a
variety of Unicode encoded inputs, in memory Unicode encodings, and Unicode
encoded outputs?

NOTE: I *must* stay away from char and wchar_t.  They are insufficiently
portable and reliable for my needs.

HELP!  Insights, understandings, explanations, enlightenments welcome,
--Eljay