Non UTF-8 charset fallback support in GLib (Was Re: plans for long term support releases?)

Daniel Yek <dyek@xxxxxxxx> · Wed, 17 Jan 2007 22:18:31 -0800

At 09:31 PM 1/17/2007, Bruno Wolff III wrote:
On Wed, Jan 17, 2007 at 23:10:14 +0100,
  Ola Thoresen <redhat@xxxxxxxx> wrote:
>
> One of the worst examples of this is the change to UTF-8 as default
> charset.  I am a devoted UTF-8 user myself, but it is probably the
> single change that has caused most pain for others, and it is stil
> causing trouble.

> When we changed to UTF-8 as default, there were no
> easy way to convert filesystems, documents, text-files, webpages...

Not sure if these two utilities could help:
(1) iconv -f old-encoding -t UTF-8 filename > newfilename

(2) utf8ize

The script:
http://ftp.penguin.cz/pub/users/utx/misc/utf8ize.gopts

The web page (search for utf8ize):
http://www.penguin.cz/~utx/

> The first thing almost everyone I know that are installing Fedora,
> Redhat or Suse is doing is to change /etc/sysconfig/i18n to go back to
> en_US as default LANG. Simply because it takes a h... of a lot of work
> to convert all your files and applications and there are no good tools
> out there to help you.

UTF-8 is an encoding and en_US is a locale. You are comparing different
types of things. Perhaps you meant that UTF-8 was being used instead of
ASCII or Latin 1? Note that ASCII is in a sense a subset of UTF-8, so
converting from ASCII to UTF-8 isn't a big deal.

Something that I don't feel GLib has done enough is to have enough API 
supporting non UTF-8 content. For example, if a text file is opened using 
GIOChannel, the read would fail if the file content isn't containing only 
UTF-8 content.

The fallback could be more graceful; for example, the API could allow a 
fallback charset to convert bytes that aren't legal UTF-8 byes to UTF-8. 
There should exist enough API that is as tolerant to non UTF-8 content as 
possible (such as using fallback charset).

For example, a lot of people could be using a single European charset 
before UTF-8 became mainstream. So, with just one fallback charset 
specified, all these people could have been covered. Their files could be 
opened and new files are saved as UTF-8 charset.

As it is now, if you want your application to support reading of both UTF-8 
and ISO-8859-1 encodings (just the most common 2 sets, not more), most 
facilities in GLib are not a choice -- if one text file contains just one 
copyright symbol encoded in ISO-8859-1, you fail to read the entire text 
file...very far from an ideal scenario.

What do people think?

--
Daniel Yek

--
fedora-devel-list mailing list
fedora-devel-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/fedora-devel-list