On Tue, Aug 18, 2020 at 8:22 PM Jim Jarvie <jim@xxxxxxxxxxxxxx> wrote:
I've tuned the LIMIT value both up and down. As I move the limit up, the problem becomes substantially worse; 300 swamps it and the selects take > 1 hour to complete; at 600 they just all lock everything up and it stops processing. I did try 1,000 but it basically resulted in nothing being processed.
You've only described what happens when you turn the LIMIT up. What happens when you turn it down? Why did you pick 250 in the first place? I don't see the rationale for having 250*256 rows locked simultaneously. I can see reasons you might want a LIMIT as high as 250, or for having 256 processes. I just don't see why you would want to do both in the same system.
Less processes does not give the throughput required because the queue sends data elsewhre which has a long round trip time but does permit over 1K concurrent connections as their work-round for throughput. I'm stuck having to scale up my concurrent processes in order to compensate for the long processing time of an individual queue item.
You've tied the database concurrency to the external process concurrency. While this might be convenient, there is no reason to think it will be optimal. If you achieve concurrency by having 256 processes, why does each process need to lock 250 rows at time. Having 64,000 rows locked to obtain 256-fold concurrency seems like a poor design.
With modern tools it should not be too hard to have just one process obtain 1000 rows, and launch 1000 concurrent external tasks. Either with threads (making sure only one thread deals with the database), or with asynchronous operations. (Then the problem would be how to harvest the results, it couldn't unlock the rows until all external tasks have finished, which would be a problem if some took much longer than others).
It is easy to reproduce scaling problems when you have a large number of processes trying to do ORDER BY id LIMIT 250 FOR UPDATE SKIP LOCKED without all the partitioning and stuff. I don't know if the problems are as severe as you describe with your very elaborate setup--or even if they have the same bottleneck. But in the simple case, there seems to be a lot of spin-lock contention, as every selecting query needs to figure out if every marked-as-locked row is truly locked, by asking if the apparently-locking transaction is still valid.
Cheers,
Jeff