Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

David Rowley <dgrowleyml@xxxxxxxxx> · Mon, 14 Sep 2020 01:52:23 +1200

On Tue, 8 Sep 2020 at 06:05, Raj <raj@xxxxxxxx> wrote:
>
> > This would not exactly look like a bug, because the message says "to
> > be locked", so at least it's not allowing two workers to lock the same
> > tuple. But it seems that the skip-locked mode should not make an error
> > out of this, but treat it as the tuple was already locked. Why would
> > it want to lock the tuple (representing the job) if another worker has
> > already finished his UPDATE of the job to mark it as "done" (which is
> > what makes the tuple move to the "completed" partition.)

(It's not very clear who wrote the above text since the quote does not
mention who the author is and the original email didn't appear to have
made it to the list)

It's not a bug. I think the quoted text is expecting a bit too much
from the database. It does not know that if the tuple is updated and
moved to another partition that it can be safely ignored.   For all
the database knows, the new version of the tuple that's in the new
partition still matches the query's WHERE clause and should be locked.
If we just go and ignore moved off tuples then we could miss
processing tuples that still need to be processed.

It's perhaps not impossible to make it work slightly better if it were
somehow possible to inform heapam_tuple_lock() that it's operating on
a partition and the query queried a partitioned table and that all but
1 partition was pruned with partition pruning.  In this case we could
be certain the new verison of the tuple can't match the WHERE clause
of the SELECT since partition pruning determined that all other
partitions don't match the WHERE clause. However, that's:

a) a pretty horrid thing to have to teach heapam_tuple_lock() about, and;
b) only going to work when 1 partition survives partition pruning,
which is pretty horrible since doing ATTACH PARTITION could suddenly
cause your queries to fail randomly.

If you had 3 partitions, one for "pending", "retry" and "complete",
and you wanted to lock all rows that are in a "pending" or "retry"
state, then when we encounter an updated row in the "pending"
partition, we have no knowledge if it was moved into the "retry" or
the "completed" partition.   If it's in "retry", then we do want to
find it and process it, but if it's in "completed", then it does not
match the WHERE clause of the query and we can ignore it.  Since we
don't know which, we can't make assumptions and must force the user to
try again, hence the serialisation failure error.

> > Either the SELECT for jobs to do returned a wrong tuple, which was
> > already updated, or there is some lapse in the locking.
> >
> > Either way it would seem to be a waste of time throwing all these
> > errors when the tuple should not even have been selected for update
> > and locking.
> >
> > I wonder if anybody knows anything about that issue? Of course you'll
> > want to see the DDL and SQL queries, etc. but you can't really try it
> > out unless you do some massively parallel magic.

I ready mentioned why this cannot work that way [1].  If you have some
idea on how to make it work correctly, then it would be interesting to
hear. Otherwise, I'm sorry to say that we can't just ignore these
tuples because it happens to suit your use case.

The solution is just to make the application retry on serialisation failures.

David

[1] https://www.postgresql.org/message-id/CAApHDvrDH6TQeLxTqnnAnhjrs55ru5g2_QMG=ME+WvD5MmpHQg@xxxxxxxxxxxxxx