Re: pg_dump slow with bytea data

Merlin Moncure <mmoncure@xxxxxxxxx> · Mon, 7 Mar 2011 08:52:22 -0600

On Mon, Mar 7, 2011 at 7:28 AM, chris r. <chricki@xxxxxxx> wrote:
> Merlin, first of all, thanks for your reply!
>
>> hm.  where exactly is all this time getting spent?  Are you i/o bound?
>> cpu bound? Is there any compression going on?
> Very good questions. pg_dump -F c compresses per default "at a moderate
> level" (manpage), whatever compression level 'moderate' actually means.
> Thus, yes, without explicitly activating it, we use compression.
>
> For testing, I inserted a fraction of our huge table with bytea content
> to the table 'testtable'. The next three outputs compare pg_dump for
> this table with default compression level, no compression and low-level
> compression on level 3. The time spent seems CPU-bound, as in the first
> test case 90-100% of a CPU-core is used all over the time.
>
>
> (default compression)
> time pg_dump -f /tmp/test.sql -F c -t testtable mydb
> real    0m27.255s
> user    0m26.383s
> sys     0m0.180s
>
> (low-level compression)
> time pg_dump -f /tmp/test.sql -F c -Z 3 -t testtable mydb
> real    0m8.883s
> user    0m8.112s
> sys     0m0.161s
>
> (no compression)
> time pg_dump -f /tmp/test.sql -F c -Z 0 -t testtable mydb
> real    0m1.892s
> user    0m0.074s
> sys     0m0.279s
>
> To summarize, in our case-scenario, moderate-level compression caused a
> speed-loss of factor 14.

right -- well in the short term it looks like you should consider
lowering or disabling compression.

> In another test, I'll compare pg_dump of a table with textual content
> that I created stupidly with:
>
>  select (t/23.0)::text||(t/17.0)::text
>  into testtable
>  from generate_series(1, 1000000) t;
>
> Very much to my surprise, dumping this table did not show such a huge
> difference when using compression: a default-compressed pg_dump took
> 2.4s, whereas a non-compressed pg_dump took 2.0s (which is merely factor
> 1.2x). However, when expanding the series to 3 mio (instead of 1 mio),
> the compressed pg_dump took 7.0s, whereas a non-compressed pg_dump ran
> for 2.4s only (factor 3x). Does this show that compression takes
> relatively longer the more data it needs to compress? Memory consumption
> was less than 12 MB during testing.

Most compression algs don't use a lot of memory.  Also, as a general
rule of thumb low entropy data compresses must faster than high
entropy data so you can't really compare synthetic tests like that to
real world data as you discovered.

Unfortunately, compression is something of a weak point for the
postgres project: there are much better bang/buck ratio algorithms out
there that we can't use because of licensing or patent concerns.
There are a lot of easy workarounds though (like rigging command line
compressor post dump) so it isn't really a big deal for backups.  You
may want to investigate if your bytea columns are being toast
compressed and look there if you are having performance issues.

>> Maybe this is a
>> performance issue inside pg_dump itself, not necessarily a text/binary
>> issue (i have a hard time believing going from b64->hex is 10x slower
>> on format basis alone).  Can you post times comparing manual COPY via
>> text, manual COPY via binary, and pg_dump -F c?
> Again, valid points. As a next step, I'll compare the COPY variants.
>
> time psql mydb -c "COPY testtable TO '/tmp/test.sql' WITH (FORMAT 'text');"
> real    0m1.712s
> user    0m0.001s
> sys     0m0.004s
>
> In text format, the time it takes to COPY testtable seems very much like
> the time it takes to run pg_dump without compression.
>
> Interestingly, COPYing testtable with binary format gives another factor
> 3.6x speedup:
>
> time psql mydb -c "COPY testtable TO '/tmp/test.sql' WITH (FORMAT
> 'binary');"
> real    0m0.470s
> user    0m0.000s
> sys     0m0.005s
>
> As one may argue the table was too small to compare the runtime, I
> repeated this second comparison with two larger tables - both times
> showing between 5x-6x speedup with binary format! In either format the
> operation seemed CPU bound (> 95% of a core was taken).
>
>
>
> To summarize, I could speed up my backup by removing compression (factor
> 14) and using COPY in binary format instead of pg_dump (factor 5 to
> factor 6). However, only the first option would keep data integrity. To
> have an easy integrity-save backup, IMHO, the second option can only be
> achieved by having an additional switch in pg_dump allowing for binary
> output.

Well, that's a pretty telling case, although I'd venture to say not
typical.  In average databases, I'd expect 10-50% range of improvement
going from text->binary which is often not enough to justify the
compatibility issues.  Does it justify a 'binary' switch to pg_dump?
I'd say so -- as long as the changes required aren't to extensive
(although you can expect disagreement on that point).  hm. i'll take a
look...

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general