Re: Increased size of database dump even though LESS consumed storage

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Wed, 10 Feb 2021 10:38:51 +0100

Thorsten:

On Wed, Feb 10, 2021 at 9:58 AM Thorsten Schöning <tschoening@xxxxxxxxxx> wrote:
...
> I've changed the new table "datagram" to be a partitioned one with
> partitions containing rows per year and per half-year. Each partition
> contains far less rows than before this way and while I only tested
> with "--compress=9" this time, the numbers are quite interesting:
>
> > unpartitioned:     6,4 GiB
> > half-yearly parts: 4,8 GiB
> > yearly parts:      4,8 GiB
>
> The interesting part this time is that the table layout for all
> partitions is the same like before, only the number of rows per table
> is different. Though, the number of rows overall is the same like
> before, the same data, IDs etc. Though, this time the dump really is
> smaller than with the OLD schema containing far more data because of
> duplicate IDs and stuff.
> I wouldn't have expected table layout to be that important.

Compresion is dependent on detectable redundancy on the input. pg_dump
more or less gzips per-table "copy to stdout" dumps. If your data
evolves in a determined way having it sorted by ingestion time may
increase detectable redundancy a lot, and partitioning sorts partially
by date ( or fully if you have made the partitions by range-querying
via index scan ). In this case it may not be the layout, but the
order.

Given you seem to be able to test, you may try sorting the full table
by the column you use for partitioning. IIRC cluster will do the trick
if it is indexed.

( This has happened to me compressing document dumps, presorting by
some chosen fields improved my compression ratio a bit, IIRC it was
due to the compressor searching for duplicates on a limited window
only, this is why I use "detectable" redundancy )

Francisco Olarte.