Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Thu, 03 Dec 2009 23:54:34 +0800

Frank Sweetser wrote:

> Unless, of course, you're at a good sized school with lots of
> international students, and have fileservers holding filenames created
> on desktops running in Chinese, Turkish, Russian, and other locales.

What I struggle with here is why they're not using ru_RU.UTF-8,
cn_CN.UTF-8, etc as their locales. Why mix charsets?

I don't think that these people should be forced to use a utf-8 database
and encoding conversion if they want to do things like mix-and-match
charsets for file name chaos on their machines, though. I'd just like to
be able to back up systems that _do_ have consistent charsets in ways
that permit me to later reliably search for files by name, restore to
any host, etc.

Perhaps I'm strange in thinking that all this mix-and-match encodings
stuff is bizarre and backward. The Mac OS X and Windows folks seem to
agree, though. Let the file system store unicode data, and translate at
the file system or libc layer for applications that insist on using
other encodings.

I do take Greg Stark's point (a) though. As *nix systems stand,
solutions will only ever be mostly-works, not always-works, which I
agree isn't good enough. Since there's no sane agreement about encodings
on *nix systems and everything is just byte strings that different apps
can interpret in different ways under different environmental
conditions, we may as well throw up our hands in disgust and give up
trying to do anything sensible. The alternative is saying that files the
file system considers legal can't be backed up because of file naming,
which I do agree isn't ok.

The system shouldn't permit those files to exist, either, but I suspect
we'll still have borked encoding-agnostic wackiness as long as we have
*nix systems at all since nobody will ever agree on anything for long
enough to change it.

Sigh. I think this is about the only time I've ever wished I was using
Windows (or Mac OS X).

Also: Greg, your point (c) goes two ways. If I can't trust my backup
software to restore my filenames from one host exactly correctly to
another host that may have configuration differences not reflected in
the backup metadata, a different OS revision, etc, then what good is it
for disaster recovery? How do I even know what those byte strings
*mean*? Bacula doesn't even record the default system encoding with
backup jobs so there's no way for even the end user to try to fix up the
file names for a different encoding. You're faced with some byte strings
in wtf-is-this-anyway encoding and guesswork. Even recording lc_ctype in
the backup job metadata and offering the _option_ to convert encoding on
restore would be a big step, (though it wouldn't fix the breakage with
searches by filename not matching due to encoding mismatches).

Personally, I'm just going to stick to a utf-8 only policy for all my
hosts, working around the limitation that way. It's worked ok thus far,
though I don't much like the way that different normalizations of
unicode won't match equal under SQL_ASCII so I can't reliably search for
file names.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general