Re: Increased size of database dump even though LESS consumed storage

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Wed, 10 Feb 2021 13:30:43 +0100

Thorsten:

On Wed, Feb 10, 2021 at 12:36 PM Thorsten Schöning
<tschoening@xxxxxxxxxx> wrote:
> Guten Tag Francisco Olarte,
> am Mittwoch, 10. Februar 2021 um 10:38 schrieben Sie:
> > [...]and partitioning sorts partially
> > by date ( or fully if you have made the partitions by range-querying
> > via index scan ).[...]
> That statement is especially interesting not only because of my
> dump-size, but I'm running into the problem that queries spanning more
> than one partition seem to prefer sequential scan over using indexes.
> My indexe seems to only be used when querying the rows of one
> partition.

You know the drill, show your indexes, post some explain for these,
those statements are too fuzzy to infer any useful advice.

> So, the following is my definition, should that be "range-queried via
> index scan" properly? :-)
> > CREATE TABLE datagram
> > (
> >   id             bigserial                 NOT NULL,
> >   src_re         integer                   NOT NULL,
> >   src_clt        integer                   NOT NULL,
> >   src_meter      integer                   NOT NULL,
> >   captured_at    timestamp with time zone  NOT NULL,
> >   captured_rssi  smallint                  NOT NULL,
> >   oms_status     smallint                  NOT NULL,
> >   oms_enc        bytea,
> >   oms_dec        bytea
> > ) PARTITION BY RANGE (captured_at);
> > CREATE TABLE datagram_y1970_h1 PARTITION OF datagram FOR VALUES FROM ('1970-01-01') TO ('1970-07-01');
> > CREATE TABLE datagram_y1970_h2 PARTITION OF datagram FOR VALUES FROM ('1970-07-01') TO ('1971-01-01');
> > [...]

Given that definition there is no possible index scan, as there are no indexes.

> > INSERT INTO datagram([...]) SELECT * FROM datagram_unpart;

But given this what I was suggesting is not true. What I meant is that
if you had big datagram_unpart table indexed by captured_at with good
index correlation, which given the name of the field would be
expected, and populated the partitions one by one by doing range
queries over datagram unpart you may have hit index scans on each
population query and end up with perfectly sorted partitions, and that
could help compressions.

Given your query, unless the optimizer is performing some supernatural
( for me ) trickery my bet would be on a full table scan plus
insertion which would left you with whatever order you had originally
for each partition.

But if your data, or your "data shape" has some natural correlation
with captured timestamps ( i.e., I know that my CDR data has it ),
puttig similar data together could have improved your compression
ratios. Bear in mind I do not know it.

Anyway, I would not worry too much about the backup ratios with the
sizes you have, I would worry more on doing things like augmenting
fill factors and similar things if your data processing is similar to
mine ( data enters at time captured_at, sometimes is fixed due to
errors but normally never moves after being in the tables for a couple
of months, and when it does it is a special problems which can be
manually tuned ). It will not improve the backup too much, but can
improve other things. i.e., in my big CDR ( that means call detail
record, telephony ) tables I let them insert into "normal" monthly
partitions, but when they are some month olds I coalesce them into
fully sorted, unindexed, 100% filled yearly partitions. Those
partitions are then put into an "historic" schema which I do NOT
backup automatically, I only do it after the ( manually fired )
coalescing and the ( manually done ) very ocasional long term fixups (
I rarely have to touch nothing older than a month ).

Regards.
    Francisco Olarte.