I'd say to do some monitoring of your machine when this is happening. vmstat, iostat, iotop, htop, and so on. Are you running out of memory, a context switch / interrupt storm? IO bound? And so on. On Tue, May 22, 2012 at 12:20 PM, Lonni J Friedman <netllama@xxxxxxxxx> wrote: > No one has any ideas or suggestions, or even questions? If someone > needs more information, I'd be happy to provide it. > > This problem is absolutely killing me. > > On Mon, May 21, 2012 at 2:05 PM, Lonni J Friedman <netllama@xxxxxxxxx> wrote: >> Greetings, >> I have a 4 server postgresql-9.1.3 cluster (one master doing streaming >> replication to 3 hot standby servers). All of them are running >> Fedora-16-x86_64. Last Friday I upgraded the entire cluster from >> Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3. I >> made no changes to postgresql.conf following the upgrade. I used >> pg_upgrade on the master to upgrade it, followed by blowing away >> $PGDATA on all the standbys and rsyncing them fresh from the master. >> All of the servers have 128GB RAM, and at least 16 CPU cores. >> >> Everything appeared to be working fine until last night when the load >> on the master suddenly took off, and hovered at around 30.00 ever >> since. Prior to the load spike, the load was hovering around 2.00 >> (which is actually lower than it was averaging prior to the upgrade >> when it was often around 4.00). When I got in this morning, I found >> an autovacuum process that had been running since just before the load >> spiked, and the pg_dump cronjob that started shortly after the load >> spike (and normally completes in about 20 minutes for all the >> databases) was still running, and hadn't finished the first of the 6 >> databases. I ended up killing the pg_dump process altogether in the >> hope that it might unblock whatever was causing the high load. >> Unfortunately that didn't help, and the load continued to run high. >> >> I proceeded to check dmesg, /var/log/messages and the postgresql >> server log (all on the master), but I didn't spot anything out of the >> ordinary, definitely nothing that pointed to a potential explanation >> for all of the high load. >> >> I inspected what the autovacuum process was doing, and determined that >> it was chewing away on the largest table (nearly 98 million rows) in >> the largest database. It was making very slow progress, at least I >> believe that was the case, as when I attached strace to the process, >> the seek addresses were changing in a random fashion. >> >> Here are the current autovacuum settings: >> autovacuum | on >> autovacuum_analyze_scale_factor | 0.1 >> autovacuum_analyze_threshold | 50 >> autovacuum_freeze_max_age | 200000000 >> autovacuum_max_workers | 4 >> autovacuum_naptime | 1min >> autovacuum_vacuum_cost_delay | 20ms >> autovacuum_vacuum_cost_limit | -1 >> autovacuum_vacuum_scale_factor | 0.2 >> autovacuum_vacuum_threshold | 50 >> >> Did something significant change in 9.1 that would impact autovacuum >> behavior? I'm at a complete loss on how to debug this, since I'm >> using the exact same settings now as prior to the upgrade. >> >> thanks > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- To understand recursion, one must first understand recursion. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general