On 9/30/20 1:22 PM, Adam Sjøgren wrote:
Tom writes:
=?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@xxxxxxxxxxxx> writes:
Tom writes:
... which implies that the problem is unexpectedly high contention for the
ProcArrayLock.
One thing I should have mentioned, but forgot, is that the database is
configured to do logical replication to another machine - could that
have an effect on the lock contention?
A colleague pointed out that in the pg_locks output, the replication
processes are waiting on the ProcArray lock:
· https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz
Yeah, that is *mighty* interesting. For the archives' sake, I'll point
out that this shows 16 walsender processes, of which 6 are in
WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10
are waiting for the ProcArrayLock.
Sorry for not returning on this before now.
I have seen the slow login problem - ProcArrayLock contention - happen
with replication turned off, so I think that can be ruled out as the
cause.
I just observed something interesting right now, however.
Just to recap the scene: the basic setup is that we have a table with a
queue of jobs to be processed, and we have a number of programs (say 10)
on 6 other machines grabbing those jobs, doing calculations and
reporting back.
I don't have an answer. Not even sure if this is relevant to the
problem, but how are the jobs getting into the queue?
The number of jobs at any given time can fluctuate from none to
millions. Typically millions of jobs take some days to a week to get
through.
Now, I happened to have a psql prompt open when the problem of new
logins being slow appeared - and I ran a query that counted the number
of jobs by job-type.
Before the slow login problem, that query took around 70-80 ms, but when
the slow login appeared, it took 20+ seconds to run the query.
EXPLAIN ANALYZE showed that it was running a parallel query.
So I tried turning parallel queries off - with SET
max_parallel_workers_per_gather = 0; - and now the query took around 150
ms consistenly, despite logins still being (variably) slow.
So that confirms the ProcArrayLock contention, I think.
My next thought is: I have 6*10 programs making connections and doing
parallel queries over the job queue, to get new jobs. If the jobs happen
to be very "easy", those programs will be doing a lot of queries for
jobs. And when the queue is large enough (I'm guessing), those queries
will be parallel, so they will be creating new processes that need to be
added and removed from the ProcArray.
So am I shooting myself in the foot here, by creating lock contention
when trying to get jobs quickly from a largeish table?
I think my next step will be to modify the programs that grab the jobs
to avoid parallel queries when looking up the next job to run.
You anticipated this, it just took me until now to get a little closer:
It's also evident that there's some parallel query activity going on,
as some sessions are in BgWorkerShutdown waits; so parallel worker
entry/exit might be contributing to the overall load on ProcArrayLock.
Best regards,
Adam
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx