Re: Catastrophic changes to PostgreSQL 8.4

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Thu, 03 Dec 2009 10:54:07 +0800

On 2/12/2009 9:18 PM, Kern Sibbald wrote:
Hello,

I am the project manager of Bacula.  One of the database backends that Bacula
uses is PostgreSQL.

As a Bacula user (though I'm not on the Bacula lists), first - thanks 
for all your work. It's practically eliminated all human intervention 
from something that used to be a major pain. Configuring it to handle 
the different backup frequencies, retention periods and diff/inc/full 
needs of the different data sets was a nightmare, but once set up it's 
been bliss. The 3.x `Accurate' mode is particularly nice.

Bacula sets the database encoding to SQL_ASCII, because although
Bacula "supports" UTF-8 character encoding, it cannot enforce it.  Certain
operating systems such as Unix, Linux and MacOS can have filenames that are
not in UTF-8 format.  Since Bacula stores filenames in PostgreSQL tables, we
use SQL_ASCII.

I noticed that while doing some work on the Bacula database a while ago.

I was puzzled at the time about why Bacula does not translate file names 
from the source system's encoding to utf-8 for storage in the database, 
so all file names are known to be sane and are in a known encoding.

Because Bacula does not store the encoding or seem to transcode the file 
name to a single known encoding, it does not seem to be possible to 
retrieve files by name if the bacula console is run on a machine with a 
different text encoding to the machine the files came from. After all, 
café in utf-8 is a different byte sequence to café in iso-9660-1, and 
won't match in equality tests under SQL_ASCII.

Additionally, I'm worried that restoring to a different machine with a 
different encoding may fail, and if it doesn't will result in hopelessly 
mangled file names. This wouldn't be fun to deal with during disaster 
recovery. (I don't yet know if there are provisions within Bacula its 
self to deal with this and need to do some testing).

Anyway, it'd be nice if Bacula would convert file names to utf-8 at the 
file daemon, using the encoding of the client, for storage in a utf-8 
database.

Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as 
Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use 
other encodings. If a unix system does use another encoding, this may be 
determined from the locale in the environment and used to convert file 
names to utf-8.

Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will 
have file names in the locale's encoding, like UNIX systems, and are 
fairly easily handled.

About the only issue I see is that systems may have file names that are 
not valid text strings in the current locale, usually due to buggy 
software butchering text encodings. I guess a *nix system _might_ have 
different users running with different locales and encodings, too. The 
latter case doesn't seem easy to handle cleanly as file names on unix 
systems don't have any indication of what encoding they're in stored 
with them. I'm not really sure these cases actually show up in practice, 
though.

Personally, I'd like to see Bacula capable of using a utf-8 database, 
with proper encoding conversion at the fd for non-utf-8 encoded client 
systems. It'd really simplify managing backups for systems with a 
variety of different encodings.

( BTW, one way to handle incorrectly encoded filenames and paths might 
be to have a `bytea' field that's generally null to store such mangled 
file names. Personally though I'd favour just rejecting them. )

We set SQL_ASCII by default when creating the database via the command
recommended in recent versions of PostgreSQL (e.g. 8.1), with:

CREATE DATABASE bacula ENCODING 'SQL_ASCII';

However, with PostgreSQL 8.4, the above command is ignored because the default
table copied is not template0.

It's a pity that attempting to specify an encoding other than the safe 
one when using a non-template0 database doesn't cause the CREATE 
DATABASE command to fail with an error.

--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general