Steinar H. Gunderson wrote:
On Wed, Jul 12, 2006 at 08:43:18AM -0700, Craig A. James wrote:
Then you killed the wrong backend...
No queries run in postmaster. They all run in postgres backends. The
postmaster does very little actual work, other than keeping track of
everybody else.
It turns out I was confused by this: ps(1) reports a process called
"postgres", but top(1) reports a process called "postmaster", but they both
have the same pid. I guess postmaster replaces its own name in the process
table when it's executing a query, and it's not really the postmaster even
though top(1) calls it postmaster.
So "kill -15 <pid>" is NOT killing the process -- to kill the process, I
have to use signal 9. But if I do that, ALL queries in progress are
aborted. I might as well shut down and restart the database, which is an
unacceptable solution for a web site.
I don't follow your logic here. If you do "kill -15 <pid>" of the postmaster
doing the work, the query should be aborted without taking down the entire
cluster. I don't see why you'd need -9 (which is a really bad idea anyhow)...
I've solved this mystery. "kill -15" doesn't immediately kill the job -- it aborts the query, but it might take 15-30 seconds to clean up.
This confused me, because the query I was using to test took about 30 seconds, so the SIGTERM didn't seem to make a difference. But when I used a harder query, one that would run for 5-10 minutes, SIGTERM still stopped it after 15 seconds, which isn't great but it's acceptable.
Bottom line is that I was expecting "instant death" with SIGTERM, but instead got an agonizing, drawn out -- but safe -- death of the query. At least that's my deduction based on experiments. I haven't dug into the source to confirm.
Thanks everyone for your answers. My "kill this query" feature is now acceptable.
Craig