Hi,
I am using an SQL queue for distributing work to massively
parallel workers.
You should look into
specialized queueing software.
...
I figured I might just pause all workers briefly to schedule the
REINDEX Queue command, but the problem with this is that while the
transaction volume is large, some jobs may take minutes to
process, and in that case we need to wait minutes to quiet the
database with then 47 workers sitting as idle capacity waiting for
the 48th to finish so that the index can be rebuilt!
The
jobs that take minutes are themselves the problem. They prevent tuples
from being cleaned up, meaning all the other jobs needs to grovel
through the detritus every time they need to claim a new row. If you
got those long running jobs to end, you probably wouldn't even need to
reindex--the problem would go away on its own as the dead-to-all tuples
get cleaned up.
Locking a tuple and leaving the
transaction open for minutes is going to cause no end of trouble on a
highly active system. You should look at a three-state method where the
tuple can be pending/claimed/finished, rather than
pending/locked/finished. That way the process commits immediately after
claiming the tuple, and then records the outcome in another transaction
once it is done processing. You will need a way to detect processes
that failed after claiming a row but before finishing, but implementing
that is going to be easier than all of this re-indexing stuff you are
trying to do now. You would claim the row by updating a field in it to
have something distinctive about the process, like its hostname and pid,
so you can figure out if it is still running when it comes time to
clean up apparently forgotten entries.
Cheers,
Jeff