autovacuum stuck on a table for 18+ hours, consuming lots of CPU time

Lonni J Friedman <netllama@xxxxxxxxx> · Tue, 22 Nov 2011 18:20:25 -0800

Greetings,
I'm running PostgreSQL-9.0.4 on a Linux-x86_64 cluster with 1 master,
and two streaming replication slaves.  Since late yesterday, the load
on the server has been noticably higher (5.00+) than normal (generally
under 1.00).  I investigated, and found that for the past ~18 hours,
there's one autovacuum process that has been running, and not making
any obvious progress:

select procpid,query_start,current_query from pg_stat_activity where
current_query LIKE 'autovacuum%' ;
 procpid |          query_start          |              current_query
---------+-------------------------------+-----------------------------------------
  30188 | 2011-11-21 22:42:26.426315-08 | autovacuum: VACUUM
public.nppsmoketests
(1 row)

select c.oid,c.relname,l.pid,l.mode,l.granted from pg_class c join
pg_locks l on c.oid=l.relation where l.pid='30188' order by l.pid;
 oid  |              relname               |  pid  |           mode
       | granted
-------+------------------------------------+-------+--------------------------+---------
 72112 | nppsmoketests                      | 30188 |
ShareUpdateExclusiveLock | t
 72617 | nppsmoketests_pkey                 | 30188 | RowExclusiveLock
       | t
 72619 | nppsmoketests_bug_idx              | 30188 | RowExclusiveLock
       | t
 72620 | nppsmoketests_active_idx           | 30188 | RowExclusiveLock
       | t
 72621 | nppsmoketests_arch_idx             | 30188 | RowExclusiveLock
       | t
 72622 | nppsmoketests_branch_idx           | 30188 | RowExclusiveLock
       | t
 72623 | nppsmoketests_current_status_idx   | 30188 | RowExclusiveLock
       | t
 72624 | nppsmoketests_build_type_idx       | 30188 | RowExclusiveLock
       | t
 72625 | nppsmoketests_gpu_idx              | 30188 | RowExclusiveLock
       | t
 72626 | nppsmoketests_os_idx               | 30188 | RowExclusiveLock
       | t
 72627 | nppsmoketests_owner_idx            | 30188 | RowExclusiveLock
       | t
 72628 | nppsmoketests_regressioncl_idx     | 30188 | RowExclusiveLock
       | t
 72629 | nppsmoketests_subtest_idx          | 30188 | RowExclusiveLock
       | t
 72630 | nppsmoketests_suiteid_idx          | 30188 | RowExclusiveLock
       | t
 72631 | nppsmoketests_suiteid_testname_idx | 30188 | RowExclusiveLock
       | t
 72632 | nppsmoketests_testcl_idx           | 30188 | RowExclusiveLock
       | t
 72633 | nppsmoketests_testname_idx         | 30188 | RowExclusiveLock
       | t
 80749 | nppsmoketests_osversion_idx        | 30188 | RowExclusiveLock
       | t
(18 rows)

When I strace PID 30188, I see tons of this scrolling past quickly,
but I'm not really sure what it means beyond a 'Timeout' not looking
good:
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
lseek(95, 753901568, SEEK_SET)          = 753901568
read(95, "\202\1\0\0\260\315\250\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
lseek(95, 753917952, SEEK_SET)          = 753917952
read(95, "\202\1\0\0 N\253\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
lseek(95, 768606208, SEEK_SET)          = 768606208
read(95, "\204\1\0\0h!~\233\1\0\0\0\230\0\360\20\360\37\4
\0\0\0\0x\237\360\0\0\237\360\0"..., 8192) = 8192
lseek(95, 753934336, SEEK_SET)          = 753934336
read(95, "\202\1\0\0 &\275\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
read(95, "\202\1\0\0\10\33\276\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
lseek(95, 753958912, SEEK_SET)          = 753958912
read(95, "\202\1\0\0x\317\307\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
lseek(95, 768614400, SEEK_SET)          = 768614400

An old thread suggests that this is a stuck spinlock:
http://archives.postgresql.org/pgsql-performance/2009-05/msg00455.php

I'm using the defaults for all the *vacuum* options in
postgresql.conf, except for:
log_autovacuum_min_duration = 2500

At this point, I'm not sure what I can safely do to either terminate
this autovacuum process, or kick it into making progress again?

thanks

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general