Re: Autovacuum seems to block database: WARNING worker took too long to start

Tom Lane <tgl@xxxxxxxxxxxxx> · Sun, 14 Nov 2010 14:15:28 -0500

=?ISO-8859-1?Q?Pablo_Delgado_D=EDaz=2DPache?= <delgadop@xxxxxxxxx> writes:
> I'm having a strange problem with postgres & autovacuum
> Everything is working fine until I start getting the following errors ...
> and postgres stops working shortly after (it stops accepting connections)

> 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> CET|1/44303|0|| LOG:  automatic vacuum of table
> "mrs.pg_catalog.pg_statistic": index scans: 1
>         pages: 0 removed, 189 remain
>         tuples: 132 removed, 4587 remain
>         system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled

Hm.  The code comment above that warning says

             * The only problems that may cause this code to
             * fire are errors in the earlier sections of AutoVacWorkerMain,
             * before the worker removes the WorkerInfo from the
             * startingWorker pointer.

but it's hard to see what problem there could lead to an issue.  (In
particular, I discount the idea that AutovacuumLock could be stuck,
because we had to acquire it in order to issue this message.)

But it strikes me that the code comment is wrong in one significant way:
if the postmaster were failing to heed SIGUSR1 at all, you could reach
the timeout here, because the fork-failed signal wouldn't get sent.

Given that you say it also stops accepting connections, I'm thinking
this is a postmaster problem not an autovacuum problem.  But you've
not provided any information about that end of it.  Exactly what happens
when you try to make a connection?  Are there any entries at all in the
postmaster log?  What about the kernel log?  Are you sure that new
connections stop working *after* this happens, and not at exactly the
same time?

> OS: Centos 5.5
> Kernel: 2.6.18-194.26.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)

Given that this is a Linux system, I think that an OOM kill on the
postmaster is a not-insignificant possibility.  Or at least I would
think that if there weren't a PostmasterIsAlive check in the autovac
launcher loop.  It's real hard to see how you could get more than
one of these messages if the postmaster were gone entirely.

Could you try strace'ing the postmaster process to see what it's doing
when this is happening?

			regards, tom lane

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin