Re: How to store text files in the postgresql?

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Sat, 13 Jun 2009 12:10:27 +0800

On Fri, 2009-06-12 at 09:07 -0700, Alan Hodgson wrote:
> On Friday 12 June 2009, Scott Ribe <scott_ribe@xxxxxxxxxxxxxxx> wrote:
> > > It's far easier to backup and restore a database than millions of small
> > > files. Small files = random disk I/O.

That depends on how you're backing up.

If you want to back a  file system up database style, use the filesystem
dump utility.

fs dump tools have gone somewhat out of fashion lately, because of space
use concerns, inconvenience of partial restores, cross-hardware/version
compat issues, etc, but they're actually really rather similar to the
result you get when backing up a DB like Pg with a fs-level snapshot. If
your dump tool supports incrementals, you also get results rather akin
to PITR WAL logging.

Personally, there's no way I'd back up a filesystem with dump utilities.
I don't trust even dumpe2fs enough, the space requirements are
prohibitive, and restores are nightmarish. I have similar problems
(minus the trust issue) with backups of databases full of LOBs, though.

> > On the contrary, I think backup is one of the primary reasons to move
> > files *out* of the database. Decent incremental backup software greatly
> > reduces the I/O & time needed for backup of files as compared to a pg
> > dump. (Of course this assumes the managed files are long-lived.)
> 
> We'll have to just disagree on that. You still have to do level 0 backups 
> occasionally. Scanning a directory tree of millions of files to decide what 
> to backup for an incremental can take forever.

In particular, if your incremental backup tool is smart enough to track
deleted files the resource requirements can be astonishing. In addition
for looking for new/altered files, the tool needs to test to see if any
previously backed up file has since vanished - and it can't really even
rely on directory modification times to exclude very static data from
checking.

I use Bacula at at work, and backing up my Cyrus mail spools is a very
I/O intensive and somewhat CPU-intensive operation even for
incrementals, since the backup server and file daemon are exchanging
file lists all the time, scanning the whole huge directory tree, etc.
The adantage, though, is that the resulting daily backups are only about
1GB instead of 60 - 70GB. When you have to keep three full monthly
backups plus daily incrementals for audit/history purposes, that
matters.

I can't see any way I could reasonably achieve a similar effect if I
stored my mail in an RDBMS. Not, at least, and still have acceptable
performance in the RDBMS.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general