On Sat, Feb 23, 2019 at 4:06 PM Gunther <raj@xxxxxxxx> wrote:
Hi,
I am using an SQL queue for distributing work to massively parallel workers.
You should look into specialized queueing software.
...
I figured I might just pause all workers briefly to schedule the REINDEX Queue command, but the problem with this is that while the transaction volume is large, some jobs may take minutes to process, and in that case we need to wait minutes to quiet the database with then 47 workers sitting as idle capacity waiting for the 48th to finish so that the index can be rebuilt!
The jobs that take minutes are themselves the problem. They prevent tuples from being cleaned up, meaning all the other jobs needs to grovel through the detritus every time they need to claim a new row. If you got those long running jobs to end, you probably wouldn't even need to reindex--the problem would go away on its own as the dead-to-all tuples get cleaned up.
Locking a tuple and leaving the transaction open for minutes is going to cause no end of trouble on a highly active system. You should look at a three-state method where the tuple can be pending/claimed/finished, rather than pending/locked/finished. That way the process commits immediately after claiming the tuple, and then records the outcome in another transaction once it is done processing. You will need a way to detect processes that failed after claiming a row but before finishing, but implementing that is going to be easier than all of this re-indexing stuff you are trying to do now. You would claim the row by updating a field in it to have something distinctive about the process, like its hostname and pid, so you can figure out if it is still running when it comes time to clean up apparently forgotten entries.
Cheers,
Jeff