Backup advice

Jeff Janes <jeff.janes@xxxxxxxxx> · Mon, 15 Apr 2013 19:54:15 -0700

On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras <emorrasg@xxxxxxxx> wrote:

On Mon, 8 Apr 2013 10:40:16 -0500

Shaun Thomas <sthomas@xxxxxxxxxxxxxxxx> wrote:

>

> Anyone else?

>

If his db has low inserts/updates/deletes he can use diff between pg_dumps (with default -Fp) before compressing.

Most "diff" implementations will read the entirety of both files into memory, so may not work well with 200GB of data, unless it is broken into a large number of much smaller files.

open-vcdiff only reads one of the files into memory, but I couldn't really figure out what happens memory-wise when you try to undo the resulting patch, the documentation is a bit mysterious.  

xdelta3 will "work" on streamed files of unlimited size, but it doesn't work very well unless the files fit in memory, or have the analogous data in the same order between the two files.

A while ago I did some attempts to "co-compress" dump files, based on the notion that the pg_dump text format does not have \n within records so it is sortable as ordinary text, and that usually tables have their "stable" columns, like a pk, near the beginning of the table and volatile columns near the end, so that sorting the lines of several dump files together will gather replicate or near-replicate lines together where ordinary compression algorithms can work their magic.  So if you tag each line with its line number and which file it originally came from, then sort the lines (skipping the tag), you get much better compression.  But not nearly as good as open-vcdiff, assuming you have the RAM to spare.

Using two dumps taken months apart on a slowly-changing database, it worked fairly well:

cat 1.sql | pigz |wc -c
329833147

cat 2.sql | pigz |wc -c
353716759

cat 1.sql 2.sql | pigz |wc -c
683548147

sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print "${.}b\t$_"' 2.sql) | pigz |wc -c
436350774

A certain file could be recovered by, for example:

zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' > 2.sql2

There all kinds of short-comings here, of course, it was just a quick and dirty proof of concept.

For now I think storage is cheap enough for what I need to do to make this not worth fleshing it out any more.

Cheers,

Jeff