Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

Adam Sjøgren <asjo@xxxxxxxxxxxx> · Thu, 20 Aug 2020 10:23:03 +0200

Tom writes:

> Frequently, stack traces taken without debug symbols are pretty useless,
> but you seem to have got lucky.

For once, hah!

> It seems clear that the unexpected wait is here:
>
>>   (gdb) bt
>>   #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>>   #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111
>>   #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181
>>   #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
>>   #4  0x000055f60d36c252 in PGSemaphoreLock ()
>>   #5  0x000055f60d3e35cc in LWLockAcquire ()
>>   #6  0x000055f60d3d14ee in ProcArrayAdd ()
>>   #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
>>   #8  0x000055f60d51c99b in InitPostgres ()
>>   #9  0x000055f60d3f2baf in PostgresMain ()
>>   #10 0x000055f60d102bf8 in ?? ()
>>   #11 0x000055f60d37e492 in PostmasterMain ()
>>   #12 0x000055f60d103fa5 in main ()
>
> which implies that the problem is unexpectedly high contention for the
> ProcArrayLock.  It's unsurprising that incoming processes would need
> that lock, since they need to add themselves to the ProcArray.  What's
> not so apparent is why there is so much activity as to block them
> from doing so for a significant amount of time.  We can guess that
> the "worker processes" you mentioned are ganging up on that lock
> because they need to compute query snapshots, but that still doesn't
> seem like enough of an explanation.
>
> Can you show us exactly what the "worker processes" are doing while
> this is going on?

One thing I should have mentioned, but forgot, is that the database is
configured to do logical replication to another machine - could that
have an effect on the lock contention?

A colleague pointed out that in the pg_locks output, the replication
processes are waiting on the ProcArray lock:

 · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz

I can try turning off replication the next time I see the problem.

The workers basically loop on:

 * SELECT a job from a job-table, UPDATE it
 * Take an advisory lock on the relevant row in the calculations table
 * Do some work, at least a couple of SELECTs from various tables
 * UPDATE or INSERT a row into the calculations table, release the lock
 * INSERT a row in an exitstatus table
 * DELETE the job

The calculations can be sub-second to several minutes, the majority
mostly in the shorter end. As this is happening there are millions of
jobs in the job-table.

When the problem occurs, the worker logfiles indicate that they are
moving along, still working.

To give a sense of scale, here are the number of transactions per second
plotted:

 · https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png

(I think the hole in the graph is when Munin timed out getting the
information, and the number of transactions dropped as I stopped most of
the workers.)

  Best regards,

    Adam

-- 
 "Examination and mastering of a new highly                 Adam Sjøgren
  intellectual equipment was a hard labour."           asjo@xxxxxxxxxxxx