On 1/6/07, Jorge Godoy <jgodoy@xxxxxxxxx> wrote:
Andrew Chernow <pg-job@xxxxxxxxx> writes: >>I mean, how do you handle integrity with data >> outside the database? > You don't, the file system handles integrity of the stored data. Although, > one must careful to avoid db and fs orphans. Meaning, a record with no > corresponding file or a file with no corresponging record. Always > write()/insert an image file to the system within a transaction, including > writing the image out to the fs. Make sure to unlink any paritally written > image files. And how do you guarantee that after a failure? You're restoring two different sets of data here: - backup from your database - backup from your files
you have a point -- keeping two sets of data in sync is more difficult than working of a single interface (you have to implement your own referential integrity of sorts), but that has to be balanced against the fact that postgresql is not great (yet) at storing and retrieving huge numbers of large binary objects. then again, neither are most filesystems in my opinion. also, and this is coming from a sql junky, sql is not really that good at handling binary data, you have to go right to binary prepared statements before things become even remotely reasonable. taking your side for a moment, the backup argument (against databases) is a little bit disingenuous because while dumping a 10 terabyte database is a pain, backing up a 10 terabyte filesystem is no picnic either, rsync will get seizures...you have to implement some smart logic that takes advantage of the fact that the information rarely if ever changes. andrew's posts hint at a way to do that that could be implemented directly in a database or a filesystem...one would choose a filesystem for this because the overhead is lower but there are other ways...
How do you link them together on that specific operation? Or even on a daily basis, if you get corrupted data...
if you take this approach, you have to layer a middleware over the filesystem and use that always. it's a programming challenge but it can be done...
>>>How do you plan your backup routine > In regards to backup, backup the files one-by-one. Grab the lastest image > file refs from the database and start backing up those images. Each > successfully backed up image should be followed by inserting that file's > database record into a remote db server. If anything fails, cleanup the > partial image file (to avoid orphaned data) and rollout the transaction. > > just one idea. i'm sure there are other ways of doing it. point is, this is > completely possible to do reliably. Wouldn't replication with, e.g., Slony be easier? And wouldn't letting the database handle all the integrity be easier? I mean, create an "images" table and then make your record depends on this table, so if there's no record with the image, you won't have any references to it left.
I think if you were to replicate a really big database, for something like this, a log based replication approach (pitr, or a modified flavor of it) would be a better bet. pg_dump should not even enter into your vocabulary, unless you did some hackery like storing data in progressive tables.
It would also make the backup plan easier: backup the database.
backing up big *anythings* can be tough, pg_dump is not a scalable tool.
Not counting that depending on your choice of filesystem and image size you might get a very poor performance.
performance of storing large blobs in the database is not great...if you can get 50mb/sec writing to a filesystem, count yourself lucky if you get 10mb/sec writing to the database, and this is only if you make your client code very efficient. I think at some point in the future postgresql might be a useful tool in the way you are describing. currently, there are some performance issues, more flexibility in backup options would be required, and maybe some other things. merlin