Will Storey <will@xxxxxxxxxxxxx> writes: > Thanks for the pointer! > It looks like we'd have to be entering that section and finding the > statement timeout not set, otherwise we'd not fall through to the "user > request" case. > But as far as I can tell, the only reason my query would be cancelled is > because of the statement timeout. I wonder if there is some way the > statement timeout flag is being lost/cleared. I see there's situations > where we clear timeout flags. I poked at this for awhile and concluded that what you must be seeing is that the statement timeout interrupt triggers, but no CHECK_FOR_INTERRUPTS call happens thereafter, until we get to the disable_statement_timeout() call in finish_xact_command(). That throws away the timeout-happened indicator, but it does nothing to reset InterruptPending or QueryCancelPending. Those flags will also be thrown away eventually, after we've read a new client command: /* * (4) disable async signal conditions again. * * Query cancel is supposed to be a no-op when there is no query in * progress, so if a query cancel arrived while we were idle, just * reset QueryCancelPending. ProcessInterrupts() has that effect when * it's called when DoingCommandRead is set, so check for interrupts * before resetting DoingCommandRead. */ CHECK_FOR_INTERRUPTS(); DoingCommandRead = false; Ignoring the timeout if it happened too late is what we're supposed to be doing, so that's fine so far. But: what happens if some other CHECK_FOR_INTERRUPTS occurs before we reach the main input loop? Of course, exactly the symptom you saw: a bogus "canceling statement due to user request" error report. What's worse, that happens after we've completed the transaction, leaving the client with a very misleading idea of whether the command committed or not. So how could that happen? After experimenting, I found a pretty obvious candidate: statement duration logging happens after we do finish_xact_command, and there's a CHECK_FOR_INTERRUPTS in errfinish(). So I hypothesize that you have log_duration or log_min_duration_statement enabled. I was able to replicate the problem by lobotomizing a couple of CHECK_FOR_INTERRUPTS that would normally happen in a simple "select pg_sleep()" query (see attached for-demonstration- purposes-only patch). Then I get regression=# set statement_timeout TO '1s'; SET regression=# select pg_sleep(2.0); -- completes normally after 2 sec pg_sleep ---------- (1 row) regression=# set log_min_duration_statement TO '100ms'; SET regression=# select pg_sleep(2.0); -- sits for 2 sec, then: ERROR: canceling statement due to user request So that's not very good and we ought to do something about it. We can't have disable_statement_timeout() clear QueryCancelPending, because that might lose state about some other cause for query cancel. I'm inclined to think that we need to split QueryCancelPending into more than one flag. A separate question is how come the particular query you're complaining about has (seemingly) a fairly wide window where it never does any CHECK_FOR_INTERRUPTS call before terminating. Perhaps there's someplace we need to sprinkle a few more of those. regards, tom lane
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c index 80ed9cc..d09ac95 100644 --- a/src/backend/executor/nodeResult.c +++ b/src/backend/executor/nodeResult.c @@ -72,7 +72,7 @@ ExecResult(PlanState *pstate) PlanState *outerPlan; ExprContext *econtext; - CHECK_FOR_INTERRUPTS(); +// CHECK_FOR_INTERRUPTS(); econtext = node->ps.ps_ExprContext; diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index d330a88..9f515db 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -369,7 +369,7 @@ pg_sleep(PG_FUNCTION_ARGS) float8 delay; long delay_ms; - CHECK_FOR_INTERRUPTS(); +// CHECK_FOR_INTERRUPTS(); delay = endtime - GetNowFloat(); if (delay >= 600.0)