Using V12, Linux [Ubuntu 16.04LTS]
I have a system which implements a message queue with the basic pattern that a process selects a group of, for example 250, rows for processing via SELECT .. LIMIT 250 FOR UPDATE SKIP LOCKED.
When there are a small number of concurrent connections to process the queue, this seems to work as expected and connections quickly obtain a unique block of 250 rows for processing.However, as I scale up the number of concurrent connections, I
see a spike in CPU (to 100% across 80 cores) when the SELECT FOR
UPDATE SKIP LOCKED executes and the select processes wait for
multiple minutes (10-20 minutes) before completing. My use case
requires around 256 concurrent processors for the queue but I've
been unable to scale beyond 128 without everything grinding to a
halt.
The queue table itself fits in RAM (with 2M hugepages) and during the wait, all the performance counters drop to almost 0 - no disk read or write (semi-expected due to the table fitting in memory) with 100% buffer hit rate in pg_top and row read around 100/s which is much smaller than expected.
After processes complete the select and the number of waiting
selects starts to fall, CPU load falls and then suddenly the
remaining processes all complete within a few seconds and things
perform normally until the next time there are a group of SELECT
FOR UPDATE statements which bunch together and things then repeat.
I found that performing extremely frequent vacuum analyze (every 30 minutes) helps a small amount but this is not that helpful so problems are still very apparent.
I've exhausted all the performance tuning and analysis results I can find that seem even a little bit relevant but cannot get this cracked.
Is anyone on the list able to help with suggestions of what I can do to track why this CPU hogging happens as this does seem to be the root of the problem?
Thanks in advance,
Jim