On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras <emorrasg@xxxxxxxx> wrote:
On Mon, 8 Apr 2013 10:40:16 -0500
Shaun Thomas <sthomas@xxxxxxxxxxxxxxxx> wrote:
>
> Anyone else?
>
If his db has low inserts/updates/deletes he can use diff between pg_dumps (with default -Fp) before compressing.
Most "diff" implementations will read the entirety of both files into memory, so may not work well with 200GB of data, unless it is broken into a large number of much smaller files.
open-vcdiff only reads one of the files into memory, but I couldn't really figure out what happens memory-wise when you try to undo the resulting patch, the documentation is a bit mysterious.
xdelta3 will "work" on streamed files of unlimited size, but it doesn't work very well unless the files fit in memory, or have the analogous data in the same order between the two files.
A while ago I did some attempts to "co-compress" dump files, based on the notion that the pg_dump text format does not have \n within records so it is sortable as ordinary text, and that usually tables have their "stable" columns, like a pk, near the beginning of the table and volatile columns near the end, so that sorting the lines of several dump files together will gather replicate or near-replicate lines together where ordinary compression algorithms can work their magic. So if you tag each line with its line number and which file it originally came from, then sort the lines (skipping the tag), you get much better compression. But not nearly as good as open-vcdiff, assuming you have the RAM to spare.
Using two dumps taken months apart on a slowly-changing database, it worked fairly well:
cat 1.sql | pigz |wc -c
329833147
cat 2.sql | pigz |wc -c
353716759
cat 1.sql 2.sql | pigz |wc -c
683548147
sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print "${.}b\t$_"' 2.sql) | pigz |wc -c
436350774
329833147
cat 2.sql | pigz |wc -c
353716759
cat 1.sql 2.sql | pigz |wc -c
683548147
sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print "${.}b\t$_"' 2.sql) | pigz |wc -c
436350774
A certain file could be recovered by, for example:
zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' > 2.sql2
There all kinds of short-comings here, of course, it was just a quick and dirty proof of concept.
For now I think storage is cheap enough for what I need to do to make this not worth fleshing it out any more.
Cheers,
Jeff