Re: Invulnerable VACUUM process thrashing everything

"Jeffrey W. Baker" <jwbaker@xxxxxxx> · Thu, 29 Dec 2005 18:16:02 -0800

On Thu, 2005-12-29 at 22:53 +0000, Russ Garrett wrote:
> In my experience a kill -9 has never resulted in any data loss in this 
> situation (it will cause postgres to detect that the process died, shut 
> down, then recover), and most of the time it only causes a 5-10sec 
> outage. I'd definitely hesitate to recommend it in a production context 
> though, especially since I think there are some known race-condition 
> bugs in 7.4.
> 
> VACUUM *will* respond to a SIGTERM, but it doesn't check very often - 
> I've often had to wait hours for it to determine that it's been killed, 
> and my tables aren't anywhere near 1TB. Maybe this is a place where 
> things could be improved...

FWIW, I murdered this process with SIGKILL, and the recovery was very
short.

> Incidentally, I have to kill -9 some of our MySQL instances quite 
> regularly because they do odd things. Not something you want to be 
> doing, especially when MySQL takes 30mins to recover.

Agreed.  After mysql shutdown with MyISAM, all tables must be checked
and usually many need to be repaired.  This takes a reallllllly long
time.

-jwb

> Russ Garrett
> Last.fm Ltd.
> russ@xxxxxxx
> 
> Ron wrote:
> 
> > Ick.  Can you get users and foreign connections off that machine, lock 
> > them out for some period, and renice the VACUUM?
> >
> > Shedding load and keeping it off while VACUUM runs high priority might 
> > allow it to finish in a reasonable amount of time.
> > Or
> > Shedding load and dropping the VACUUM priority might allow a kill 
> > signal to get through.
> >
> > Hope this helps,
> > Ron
> >
> >
> > At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote:
> >
> >> A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
> >> decided to VACUUM a table which has not been updated in over a year and
> >> is more than one terabyte on the disk.  Because of the very high
> >> transaction load on this database, this VACUUM has been ruining
> >> performance for almost a month.  Unfortunately is seems invulnerable to
> >> killing by signals:
> >>
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -HUP 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -INT 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >> # kill -PIPE 15308
> >> # ps ax | grep VACUUM
> >> 15308 ?        D    588:00 postgres: postgres skunk [local] VACUUM
> >>
> >> o/~ But the cat came back, the very next day ...
> >>
> >> I assume that if I kill this with SIGKILL, that will bring down every
> >> other postgres process, so that should be avoided.  But surely there is
> >> a way to interrupt this.  If I had some reason to shut down the
> >> instance, I'd be screwed, it seems.
> >
> >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
> >
>