Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie <jim@xxxxxxxxxxxxxx> · Fri, 21 Aug 2020 11:24:20 -0400

Merging a couple of emails:

On 20-Aug.-2020 19:37, David Rowley wrote:

When updates occur in a non-partitioned table we can follow item
pointer chains to find the live row and check if the WHERE clause
still matches to determine if the row should be updated, or in this
case just locked since it's a SELECT FOR UPDATE. However, with
partitioned table, a concurrent UPDATE may have caused the row to have
been moved off to another partition, in which case the tuple's item
pointer cannot point to it since we don't have enough address space,
we only have 6 bytes for a TID. To get around the fact that we can't
follow these update chains, we just throw the serialization error,
which is what you're getting. Ideally, we'd figure out where the live
version of the tuple is and check if it matches the WHERE clause and
lock it if it does, but we've no means to do that with the current
design.
This is the absolute best description of what causes the "tuple to be 
locked was already moved to another partition due to concurrent update" 
message I have ever seen.  It totally makes sense why this happens given 
your explanation.  Thank you for giving the detail.

I am backing off the query with a time delay and then retrying and that 
seems to be the correct approach as well, but I only stumbled upon that 
by accident.  Hopefully when the google gnomes index the mailing list 
this message will come up to save others time and worry about the message.

On 21-Aug.-2020 00:34, Thomas Munro wrote:

Maybe it's just getting algorithmically ugly.  To claim some job rows,
you have to skip all dead/non-matching tuples left behind so far at
the start of the table by all the other sessions, and then also all
currently locked tuples, and you have to do update-chain walks on some
of them too.  It all gets a bit explosive once you have such high
numbers of workers.
Yes, fundamentally it seems to come down to traffic volume.  When I get 
over 128 connections all selecting and locking that one table, applying 
the locks seems to struggle and the problem grows exponentially.  I'm an 
extreme edge case, so not really a scenario that locking could ever have 
been expected to handle but it's pretty good that it gets up to 128 at all.
I think I'd experiment with splitting the job table up into N tables
and feed jobs into all of them about evenly (by hashing, at random,
whatever), and then I'd assign each consumer a "preferred" table where
it looks for jobs first (perhaps my_worker_id % ntables), before
trying the others in round robin order.  Then they won't trample on
each other's toes so much.
I think this is a good idea, but (in my case) I think this is where it 
will need v13 which is going to give that via "Allow ROW values to be 
used as partitioning expressions" ? (e.g. will v13 permit queueid mod 
200 as the partition expression to make 200 partitions to allow 200 
[contention free] consumers?).
In the past I've wondered about a hypothetical circular_seqscan
option, which would cause table scans to start where they left off
last time in each backend, so SELECT * FROM t LIMIT 1 repeated would
show you a different row each time until you get all the way around to
the start again (as we're entirely within our rights to do for a query
with no ORDER BY).  That'd give the system a chance to vacuum and
start refilling the start of the table before you get around to it
again, instead of repeatedly having to step over the same useless
pages every time you need a new job.
I like the sound of this; if I understand correctly, it would 
essentially walk in insertion(-ish) order which would be OK for me. As 
long as it was documented clearly; perhaps I should put a page online 
about high traffic message queues with PostgreSQL for people to find 
when they try the same thing.

Hmm.  I guess another way to avoid colliding with others' work would
be to try to use SELECT * FROM t TABLESAMPLE SYSTEM (10) WHERE ... FOR
UPDATE SKIP LOCKED LIMIT ....  It's less cache-friendly, and less
order-preserving, but way more contention-friendly.  That has another
complication though; how do you pick 10?  And if it doesn't return any
or enough rows, it doesn't mean there isn't enough, so you may need to
be ready to fall back to the plain approach if having 250 rows is
really important to you and TABLESAMPLE doesn't give you enough.  Or
something.

For me, this is an acceptable compromise.  I just need to consume 
something on each pass and up to 250 items but getting something less 
than 250 would still allow progress and if it removed wait time, could 
even be an overall greater throughput.  I may try this just to see how 
it performs.

If someone has a strictly ordered queue, they will never really have 
this issue as they must start at the beginning and go sequentially using 
only 1 consumer.  Because I only need loose/approximate ordering and 
throughput is the objective, all the locking and contention comes into play.

By the way, when working with around 64 consumer processes I was also
annoyed by the thundering herd problem when using NOTIFY.  I found
various workaround solutions to that, but ultimately I think we need
more precise wakeups for that sort of thing, which I hope to revisit
one day.

I'm making a note of this because I also happen to have a different 
scenario which has NOTIFY with well over 100 LISTEN consumers...  That's 
not given me problems - yet - but I'm now aware of this should problems 
arise.