On Mon, 15 Apr 2013 19:54:15 -0700 Jeff Janes <jeff.janes@xxxxxxxxx> wrote: > On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras > <emorrasg@xxxxxxxx<javascript:_e({}, 'cvml', 'emorrasg@xxxxxxxx');> > > wrote: > > > On Mon, 8 Apr 2013 10:40:16 -0500 > > Shaun Thomas <sthomas@xxxxxxxxxxxxxxxx <javascript:_e({}, 'cvml', > > 'sthomas@xxxxxxxxxxxxxxxx');>> wrote: > > > > > > > > Anyone else? > > > > > > > If his db has low inserts/updates/deletes he can use diff between pg_dumps > > (with default -Fp) before compressing. > > > > Most "diff" implementations will read the entirety of both files into > memory, so may not work well with 200GB of data, unless it is broken into a > large number of much smaller files. > > open-vcdiff only reads one of the files into memory, but I couldn't really > figure out what happens memory-wise when you try to undo the resulting > patch, the documentation is a bit mysterious. > > xdelta3 will "work" on streamed files of unlimited size, but it doesn't > work very well unless the files fit in memory, or have the analogous data > in the same order between the two files. I use for my 12-13 GB dump files: git diff -p 1.sql 2.sql > diff.patch It uses 4MB for firts phase and upto 140MB on last one and makes a patch file that can be recovered with: patch 1.sql < diff.patch > 2.sql or using git apply. > A while ago I did some attempts to "co-compress" dump files, based on the > notion that the pg_dump text format does not have \n within records so it > is sortable as ordinary text, and that usually tables have their "stable" > columns, like a pk, near the beginning of the table and volatile columns > near the end, so that sorting the lines of several dump files together will > gather replicate or near-replicate lines together where ordinary > compression algorithms can work their magic. So if you tag each line with > its line number and which file it originally came from, then sort the lines > (skipping the tag), you get much better compression. But not nearly as > good as open-vcdiff, assuming you have the RAM to spare. > > Using two dumps taken months apart on a slowly-changing database, it worked > fairly well: > > cat 1.sql | pigz |wc -c > 329833147 > > cat 2.sql | pigz |wc -c > 353716759 > > cat 1.sql 2.sql | pigz |wc -c > 683548147 > > sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print > "${.}b\t$_"' 2.sql) | pigz |wc -c > 436350774 > > A certain file could be recovered by, for example: > > zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' > > 2.sql2 Be careful, some z* utils decompress the whole file on /tmp (zdiff). > There all kinds of short-comings here, of course, it was just a quick and > dirty proof of concept. A nice one ! > For now I think storage is cheap enough for what I need to do to make this > not worth fleshing it out any more. > > Cheers, > > Jeff --- --- Eduardo Morras <emorrasg@xxxxxxxx> -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general