significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

Lonni J Friedman <netllama@xxxxxxxxx> · Mon, 21 May 2012 14:05:51 -0700

Greetings,
I have a 4 server postgresql-9.1.3 cluster (one master doing streaming
replication to 3 hot standby servers).  All of them are running
Fedora-16-x86_64.  Last Friday I upgraded the entire cluster from
Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3.  I
made no changes to postgresql.conf following the upgrade.  I used
pg_upgrade on the master to upgrade it, followed by blowing away
$PGDATA on all the standbys and rsyncing them fresh from the master.
All of the servers have 128GB RAM, and at least 16 CPU cores.

Everything appeared to be working fine until last night when the load
on the master suddenly took off, and hovered at around 30.00 ever
since.  Prior to the load spike, the load was hovering around 2.00
(which is actually lower than it was averaging prior to the upgrade
when it was often around 4.00).  When I got in this morning, I found
an autovacuum process that had been running since just before the load
spiked, and the pg_dump cronjob that started shortly after the load
spike (and normally completes in about 20 minutes for all the
databases) was still running, and hadn't finished the first of the 6
databases.  I ended up killing the pg_dump process altogether in the
hope that it might unblock whatever was causing the high load.
Unfortunately that didn't help, and the load continued to run high.

I proceeded to check dmesg, /var/log/messages and the postgresql
server log (all on the master), but I didn't spot anything out of the
ordinary, definitely nothing that pointed to a potential explanation
for all of the high load.

I inspected what the autovacuum process was doing, and determined that
it was chewing away on the largest table (nearly 98 million rows) in
the largest database.  It was making very slow progress, at least I
believe that was the case, as when I attached strace to the process,
the seek addresses were changing in a random fashion.

Here are the current autovacuum settings:
autovacuum                      | on
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold    | 50
autovacuum_freeze_max_age       | 200000000
autovacuum_max_workers          | 4
autovacuum_naptime              | 1min
autovacuum_vacuum_cost_delay    | 20ms
autovacuum_vacuum_cost_limit    | -1
autovacuum_vacuum_scale_factor  | 0.2
autovacuum_vacuum_threshold     | 50

Did something significant change in 9.1 that would impact autovacuum
behavior?  I'm at a complete loss on how to debug this, since I'm
using the exact same settings now as prior to the upgrade.

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general