Hello, Thanks for all the answers; I am a bit overwhelmed by the number, so I am going to try to answer everyone in one email. The first thing to understand is that it is *impossible* to know what the encoding is on the client machine (FD -- or File daemon). On say a Unix/Linux system, the user could create filenames with non-UTF-8 then switch to UTF-8, or restore files that were tarred on Windows or on Mac, or simply copy a Mac directory. Finally, using system calls to create a file, you can put *any* character into a filename. So, rather than trying to figure everything out (impossible, I think) and rather than failing to backup files, Bacula gets the "raw" filename from the OS and stores it on the Volume then puts it in the database. We treat the filename as if it is UTF-8 for display purposes, but in all other cases, what we want is for the filename to go into the database and come back out unchanged. On MySQL we use BLOBS. On PostgreSQL, we TEXT and set the encoding to SQL_ASCII so that PostgreSQL will not attempt to do any translation. This works well, and I hope that PostgreSQL will continue to support letting Bacula insert text characters in the database with no character encoding checks in the future. See more notes below ... On Thursday 03 December 2009 03:54:07 Craig Ringer wrote: > On 2/12/2009 9:18 PM, Kern Sibbald wrote: > > Hello, > > > > I am the project manager of Bacula. One of the database backends that > > Bacula uses is PostgreSQL. > > As a Bacula user (though I'm not on the Bacula lists), first - thanks > for all your work. It's practically eliminated all human intervention > from something that used to be a major pain. Configuring it to handle > the different backup frequencies, retention periods and diff/inc/full > needs of the different data sets was a nightmare, but once set up it's > been bliss. The 3.x `Accurate' mode is particularly nice. > > > Bacula sets the database encoding to SQL_ASCII, because although > > Bacula "supports" UTF-8 character encoding, it cannot enforce it. > > Certain operating systems such as Unix, Linux and MacOS can have > > filenames that are not in UTF-8 format. Since Bacula stores filenames in > > PostgreSQL tables, we use SQL_ASCII. > > I noticed that while doing some work on the Bacula database a while ago. > > I was puzzled at the time about why Bacula does not translate file names > from the source system's encoding to utf-8 for storage in the database, > so all file names are known to be sane and are in a known encoding. We don't and cannot know the encoding scheme on Unix/Linux systems (see above), so attempting to convert them to UTF-8 would just consume more CPU time and result in errors at some point. > > Because Bacula does not store the encoding or seem to transcode the file > name to a single known encoding, it does not seem to be possible to > retrieve files by name if the bacula console is run on a machine with a > different text encoding to the machine the files came from. After all, > café in utf-8 is a different byte sequence to café in iso-9660-1, and > won't match in equality tests under SQL_ASCII. If all the filenames go in in "binary" or "litteral" form, then any tests will work fine. The only test Bacula does is "equality". Bacula doesn't worry about sorting. Users may care, but for backup and restore the only test Bacula needs is equality, and as long as you are working with unchanged byte streams everything works on every system. The one place where we do convert filenames is on Windows. We convert UCS to UTF-8. > > Additionally, I'm worried that restoring to a different machine with a > different encoding may fail, and if it doesn't will result in hopelessly > mangled file names. This wouldn't be fun to deal with during disaster > recovery. (I don't yet know if there are provisions within Bacula its > self to deal with this and need to do some testing). Yes, if you restore on a different system with a different encoding, you will end up with the same binary string at the OS level, but when you see the filenames they may look different. > > Anyway, it'd be nice if Bacula would convert file names to utf-8 at the > file daemon, using the encoding of the client, for storage in a utf-8 > database. As I mention, this is not possible since Unix/Linux stores "binary" strings. They can be in any format. > > Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as > Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use > other encodings. If a unix system does use another encoding, this may be > determined from the locale in the environment and used to convert file > names to utf-8. As I mentioned above, on Linux/Unix systems, the user is free to change the encoding at will, and when he does so, existing filenames remain unchanged, so it is not possible to choose a particular encoding and backup and restore files without changing the filenames. Bacula backs them up and restores them using "binary" strings so we don't have problems of changing encoding. > > Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will > have file names in the locale's encoding, like UNIX systems, and are > fairly easily handled. > > About the only issue I see is that systems may have file names that are > not valid text strings in the current locale, usually due to buggy > software butchering text encodings. I guess a *nix system _might_ have > different users running with different locales and encodings, too. The > latter case doesn't seem easy to handle cleanly as file names on unix > systems don't have any indication of what encoding they're in stored > with them. I'm not really sure these cases actually show up in practice, > though. > > Personally, I'd like to see Bacula capable of using a utf-8 database, > with proper encoding conversion at the fd for non-utf-8 encoded client > systems. It'd really simplify managing backups for systems with a > variety of different encodings. Maybe this will happen someday, but first we will need Linux/Unix systems that *force* the encoding to be in some particular standard. > > ( BTW, one way to handle incorrectly encoded filenames and paths might > be to have a `bytea' field that's generally null to store such mangled > file names. Personally though I'd favour just rejecting them. ) > > > We set SQL_ASCII by default when creating the database via the command > > recommended in recent versions of PostgreSQL (e.g. 8.1), with: > > > > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > > > However, with PostgreSQL 8.4, the above command is ignored because the > > default table copied is not template0. > > It's a pity that attempting to specify an encoding other than the safe > one when using a non-template0 database doesn't cause the CREATE > DATABASE command to fail with an error. I didn't actually run it myself, so it is possible that it produced an error message, but it did apparently create the database but with UTF-8 encoding. Most of these things are done in script files, so certain non-fatal errors may be overlooked. As far as I can tell, it took the above encoding command, and perhaps printed an error message but went ahead and created the database with an encoding that was not correct. If that is indeed the case, then it is in my opinion, a bad design policy. I would much prefer that either Postgres accept the command or that it not create the database. This way, either the database would work as the user expects or there would be no database, and the problem would be resolved before it creates databases that cannot be read. In any case we have corrected the command to include the TEMPLATE, but this won't help people with older Bacula's. The other point I wanted to emphasize is that the documentation implied that future versions of Postgres may eliminate the feature of having SQL_ASCII (i.e. the ability to input arbritrary binary strings). As I said, that would be a pity -- I suppose we could switch to using LOs or whatever they are called in Postgres, but that would be rather inconvenient. Thanks for all the responses, Best regards, Kern -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general