Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

Tom Lane <tgl@xxxxxxxxxxxxx> · Thu, 20 Aug 2020 12:43:47 -0400

=?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@xxxxxxxxxxxx> writes:
> Tom writes:
>> ... which implies that the problem is unexpectedly high contention for the
>> ProcArrayLock.

> One thing I should have mentioned, but forgot, is that the database is
> configured to do logical replication to another machine - could that
> have an effect on the lock contention?
> A colleague pointed out that in the pg_locks output, the replication
> processes are waiting on the ProcArray lock:
>  · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz

Yeah, that is *mighty* interesting.  For the archives' sake, I'll point
out that this shows 16 walsender processes, of which 6 are in
WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10
are waiting for the ProcArrayLock.

There are also three backends in ProcArrayGroupUpdate state, confirming
that we have a lot of contention for the ProcArrayLock ... but
interestingly, no non-walsender processes appear to be blocked directly
on the ProcArrayLock.  In particular, I *don't* see a backend that could
plausibly be the leader of that group commit; the other sessions showing
non-SELECT activity are waiting for assorted heavyweight locks, and they
all look like they're still running their transactions, not trying to
commit.  Perhaps this just reflects the fact that pg_stat_activity isn't
an instantaneous snapshot, though?

It's also evident that there's some parallel query activity going on,
as some sessions are in BgWorkerShutdown waits; so parallel worker
entry/exit might be contributing to the overall load on ProcArrayLock.

Anyway, we still don't know more than that the problem is ProcArrayLock
contention; it seems that the walsenders are involved, but we don't
really know if they're causing the problem or are just more victims.

One thing I'm curious about is whether you have walsenders starting
or stopping on a regular basis?  The ones visible here seem to all
have been running for at least an hour or so, but in a quick look
through the code it seemed like much of the walsender code that
would have any reason to take the ProcArrayLock is in walsender
startup or shutdown.

Another angle that is worth a try is to see whether you can reduce
the number of walsenders, and then see if that ameliorates the issue.
I'm not totally sure but I think we support cascaded replication,
so that you could reduce the replication load on the primary just
by reconfiguring some of the standbys to pull from other ones
not directly from the primary.

			regards, tom lane