> -----Original Message----- > From: tlillqvist@xxxxxxxxx [mailto:tlillqvist@xxxxxxxxx] On Behalf Of > Tor Lillqvist > Sent: Monday, June 09, 2008 10:24 AM > To: Boyd, Todd M. > Cc: gtk-list@xxxxxxxxx > Subject: Re: GLib: wide-character gregex? > > > Is there a regex package in GLib that is capable of > searching/matching wide > > characters? > > No. GLib's string APIs (except for the explicit wide char conversion > ones) handle just plain char strings, generally assumed to be UTF-8 in > cases where it matters. But if you know that a file is in wide > characters (i.e. UTF-16LE on Windows), then you can use > g_utf16_to_utf8() to convert its contents to UTF-8 once you have read > it in (or mapped it into memory). > > > for future reference, I would like to try and track down a wchar_t > > implementation of regex functions. I was hoping GLib already had > them, but > > perhaps I am wrong. > > Wide characters (wchar_t), although per se part of standard C, in > practise are used mostly in Windows-specific programming. On Unix and > Linux, especially in free software circles, encoding Unicode as UTF-8 > is the rule, and thus normal string functions and coding conventions > can be used. (One notable exception is OpenOffice.org, which used > UTF-16 internally also on Unix. Dunno about Mozilla, for instance.) So > in software being mainly developed by people using Linux, you seldom > see wchar_t. > > (Note that the wchar_t type in gcc on Linux is 32 bits, not 16 bits > like on Windows, so it actually can represent all characters in > current Unicode. On Windows when you use wchar_t strings you still > have to take into consideration that some characters will actually > take a pair of wchar_ts, so in practise the kind of code you end up > writing doesn't differ significantly from code that handles UTF-8 or > other variable-length encodings anyway. It is a question of handling > Unicode characters as 1..4 chars or 1..2 wchar_ts. You can't just > pretend each wchar_t is a freestanding character, and that wchar_t > strings can be split at any place with each part being valid. > Surrogate pairs do exist.) Thank you for your suggestions. As it is now, I've changed my code to convert to UTF-8 after reading the file, so that its contents can be regexed properly. I've done away with regexing the file name altogether, and I am using strstr() to determine if the extension is one that needs to be opened. Thanks for all your help! Hopefully, I can get this bugger compiling (and running!) in Win32 today. Then, I can use GLib's dynamic structures to store my data instead of the incredibly inefficient method of double-directory-traversal I'm using now. ;) Todd Boyd Web Programmer _______________________________________________ gtk-list mailing list gtk-list@xxxxxxxxx http://mail.gnome.org/mailman/listinfo/gtk-list