Re: Autovacuum seems to block database: WARNING worker took too long to start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tom, thanks for your detailed responses. I really appreciate it.
My comments below inline ...


> I'm having a strange problem with postgres & autovacuum
> Everything is working fine until I start getting the following errors ...
> and postgres stops working shortly after (it stops accepting connections)

> 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> CET|1/44303|0|| LOG:  automatic vacuum of table
> "mrs.pg_catalog.pg_statistic": index scans: 1
>         pages: 0 removed, 189 remain
>         tuples: 132 removed, 4587 remain
>         system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled

Hm.  The code comment above that warning says

            * The only problems that may cause this code to
            * fire are errors in the earlier sections of AutoVacWorkerMain,
            * before the worker removes the WorkerInfo from the
            * startingWorker pointer.

but it's hard to see what problem there could lead to an issue.  (In
particular, I discount the idea that AutovacuumLock could be stuck,
because we had to acquire it in order to issue this message.)

But it strikes me that the code comment is wrong in one significant way:
if the postmaster were failing to heed SIGUSR1 at all, you could reach
the timeout here, because the fork-failed signal wouldn't get sent.

Given that you say it also stops accepting connections, I'm thinking
this is a postmaster problem not an autovacuum problem.  But you've
not provided any information about that end of it.  Exactly what happens
when you try to make a connection?  Are there any entries at all in the
postmaster log?  What about the kernel log?  Are you sure that new
connections stop working *after* this happens, and not at exactly the
same time?

When I try to make a connection from the command line I get nothing. I mean, if I execute ...

# psql -U postgres -l

I have to Ctr-C or Ctr-Z (followed by kill %1) because I can wait forever.
It's like no more connections were accepted. Obviously our applications stop working immediately. 
Trying to stop postgres with

# /etc/init.d/postgres stop
Stopping postgresql service:  

I have to strop it with Ctr-C as it does not wotk (I waited up to 3 min. with no luck). As a consequence I had to kill the postmaster process manually.

Kernel log says nothing :(

Not sure what started failing first: the connections or the autovacuum. I just went to the log and saw that sequence of events. Searching in google I found a few references saying autovacuum and that error log "WARNING:  worker took too long to start; cancelled" were releated.
 

> OS: Centos 5.5
> Kernel: 2.6.18-194.26.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)

Given that this is a Linux system, I think that an OOM kill on the
postmaster is a not-insignificant possibility.  Or at least I would
think that if there weren't a PostmasterIsAlive check in the autovac
launcher loop.  It's real hard to see how you could get more than
one of these messages if the postmaster were gone entirely.

I just noticed the error log repeats once every minute but the content is the same ...
First time the log appeared

2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled

It repeats once every minute until we fix it

2010-11-14 09:44:01.889 CET|1208|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled
2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled

All 1,000+ lines are alike except for the date (every new line adds 1 more minute)
However the log is the same

2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING:  worker took too long to start; cancelled

According to our postgresql.conf log format ...

log_line_prefix = '%m|%l|%d|%r|%p|%i|%c|%s|%v|%x|%u| '            
                                        #   %u = user name
                                        #   %d = database name
                                        #   %r = remote host and port
                                        #   %h = remote host
                                        #   %p = process ID
                                        #   %t = timestamp without milliseconds
                                        #   %m = timestamp with milliseconds
                                        #   %i = command tag
                                        #   %c = session ID
                                        #   %l = session line number
                                        #   %s = session start timestamp
                                        #   %v = virtual transaction ID
                                        #   %x = transaction ID (0 if none)
                                        #   %q = stop here in non-session
                                        #        processes

Splitting the log ... 

(%m = timestamp with milliseconds) = 2010-11-14 09:45:03.105 CET
(%l = session line number) = 209
(%d = database name) = 
(%r = remote host and port) = 
(%p = process ID) = 3300
(%i = command tag) = 
(%c = session ID) = 4cdc2ae6.ce4
(%s = session start timestamp) = 2010-11-11 18:41:58 CET
(%v = virtual transaction ID) = 
(%x = transaction ID (0 if none)) = 0
%u = user name) = 
Message = WARNING:  worker took too long to start; cancelled

I see the following fields are the same in the 1,200+ lines of the log

(%d = database name) = 
(%r = remote host and port) = 
(%c = session ID) 
(%s = session start timestamp) 
(%v = virtual transaction ID) 
(%x = transaction ID (0 if none)) 
%u = user name) 
Message
 

Does this info help?
What does "empty database" mean? could be the autovacuum process?
I noticed the "%s" (
2010-11-11 18:41:58 CET) 
matches the time we started postgres. Here you have the log

2010-11-11 18:41:58.840 CET|1|||3268||4cdc2ae6.cc4|2010-11-11 18:41:58 CET||0|| LOG:  database system was shut down at 2010-11-11 18:38:55 CET
2010-11-11 18:41:58.902 CET|1|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| LOG:  autovacuum launcher started
2010-11-11 18:41:58.902 CET|1|||3234||4cdc2ae4.ca2|2010-11-11 18:41:56 CET||0|| LOG:  database system is ready to accept connections



Could you try strace'ing the postmaster process to see what it's doing
when this is happening?

I definitely will.

More information in case that helps. Strangely, we had the same problem in a temporary database we used last week while migrating some data. This time there was not autovacuum entry in the log ...

2010-11-10 20:26:06.441 CET|1|cache|10.19.0.48(20840)|11239|SELECT|4cdaf1cc.2be7|2010-11-10 20:26:04 CET|114/0|0|postgres| LOG:  duration: 1770.500 ms  statement: select id_type,id,id_language_fk,data from cache where id_type=5  and id = '60820269' and id_language_fk = 2
2010-11-10 20:26:07.100 CET|1|xxx|10.19.0.43(14080)|9423|SELECT|4cdaef43.24cf|2010-11-10 20:15:15 CET|80/0|0|postgres| LOG:  duration: 1178.423 ms  statement: SELECT "reviews".id FROM "reviews" WHERE ("reviews"."booking_id" = E'10101690551')  LIMIT 1
2010-11-10 20:27:26.905 CET|3|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:28:27.422 CET|4|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:28:41.641 CET|1|mrs|10.19.0.51(20106)|11272|SELECT|4cdaf1d0.2c08|2010-11-10 20:26:08 CET|6/701|0|postgres| LOG:  duration: 1878.724 ms  execute <unnamed>: select tp.*,t.id_token,uo.* from tokenprovider as tp inner join token as t on (tp.id_tokenprovider = t.id_tokenprovider_fk) inner join tkuo as tu on (t.id_token = tu.id_token_fk) inner join useoccupation as uo on (uo.id_useoccupation = tu.id_useoccupation_fk and uo.id_tokenprovider_fk = tp.id_tokenprovider) where tp.id_tokenprovider in (10203746,10369604,10718768,10001514,10170665,10091612,10829359,10797243,50013934,12536008,90529952,10064394,10769848,50013983,10012115,23951091,25688092,60712161,10455933,10766831,62781261,10560488,10808512,80751062,38735328,50013938,10481737,50013710,10916713,10556915,10999695,75417634,22313851,55471672,88737419,46563417,69789399,20418193,37363697,55232743,10328868,10171688,77959221,10380595,28839589,10439708,10165273,10872146,50356324,15842616,10597119,10943311,43894782,61323020) order by tp.id_tokenprovider,t.id_token,uo.sort
2010-11-10 20:29:28.349 CET|5|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:30:01.940 CET|1|xmas|10.19.0.43(14078)|9422|idle|4cdaef43.24ce|2010-11-10 20:15:15 CET|70/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.942 CET|1|xxxx|10.19.0.43(2900)|5793|idle|4cdaebb6.16a1|2010-11-10 20:00:06 CET|60/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.948 CET|1|mrs|10.19.0.43(14081)|9424|idle|4cdaef43.24d0|2010-11-10 20:15:15 CET|89/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.950 CET|3|xxxx|10.19.0.43(14429)|9840|idle|4cdaefc8.2670|2010-11-10 20:17:28 CET|72/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.951 CET|1|xxxx|10.19.0.43(60284)|10509|idle|4cdaf0ba.290d|2010-11-10 20:21:30 CET|83/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.961 CET|1|xxxx|10.19.0.43(49864)|10986|idle|4cdaf19f.2aea|2010-11-10 20:25:19 CET|57/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:01.969 CET|1|xxxx|10.19.0.43(50039)|11282|idle|4cdaf1d9.2c12|2010-11-10 20:26:17 CET|65/0|0|postgres| LOG:  unexpected EOF on client connection
2010-11-10 20:30:28.875 CET|6|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:31:29.210 CET|7|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:32:29.642 CET|8|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:33:30.271 CET|9|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:34:30.604 CET|10|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled
2010-11-10 20:35:31.106 CET|11|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING:  worker took too long to start; cancelled

Well, thanks for any information. I can provide more logs/information if necessary.

Pablo
 

                       regards, tom lane


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux