Re: Unicode or not?

Andrej Gelenberg <andrej.gelenberg@xxxxxxx> · Mon, 05 Mar 2012 21:52:45 +0100

Hi,

no, you can't simply cast it to wchar. I recommend to read this article
about unicode under linux:
http://www.ibm.com/developerworks/linux/library/l-linuni/index.html

There are 2 possible ways to deal with utf8: keep it char* and use as
simple c-string. Pro: it's simple and you can keep using standard str*
functions and it often smaller as wchar string. Cons: some non latin
symbols may consume more then one bytes, so strlen will report bigger
number as characters there, which can lead to problems with displaying
or counting the characters. You can steel do it with mblen, but it's bit
pain.
Second option is to convert it to wchar with mbstowcs() function. Pro:
characters are always fixed bit-width. Cons: you need to convert between
utf8 and wchar and you need additional buffer to hold wchar string (you
can't do in in-place, because wchar string will be often bigger then
utf8 string).

For example, if you need or just wont wchar string, you can do something
like this:

int l = strlen(argv[i]);
wchar_t *nbuf = calloc(sizeof(*nbuf), l);
if ( !nbuf ) return 1;
l = mbstowcs(nbuf, argv[i], l); // mbstowcs may return smaller value as
                                // l
if ( l == -1 ) {
  /* invalid multibyte sequence was encountered */
  free(nbuf);
  return 2;
}

Regards,
Andrej Gelenberg

On 03/05/2012 09:19 PM, Krzysztof wrote:
> So how to read effectively UTF-8 characters from char* passed as an
> argument under Linux?
> Should one simply cast argv[n] to wchar_t*?
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-c-programming" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html