Re: weird issue with occasional stuck queries

Adam Scott <adam.c.scott@xxxxxxxxx> · Sat, 2 Apr 2022 12:09:15 -0700

The logs were helpful.  You may want to see the statements around  the errors, as more detail may be there such as the SQL statement associated with the error.

Deadlocks are an indicator that the client code needs to be examined for improvement.  See https://www.cybertec-postgresql.com/en/postgresql-understanding-deadlocks/ about deadlocks.  They will slow things down and could cause a queue of SQL statements eventually bogging down the system.

It definitely looks like locking issues which is why you don't see high CPU.  IIRC you might see high system CPU usage, as opposed to userspace CPU, where the kernel is getting overloaded. The `top` command will help to show that.  

The disks could be saturated by the write ahead log (WAL) handling of all the transactions.  More about WAL here: https://www.postgresql.org/docs/10/wal-internals.html  You could consider moving that directory somewhere else using a symbolic link (conf. the link)

Anyway, these are the things I would look at.

Adam

On Sat, Apr 2, 2022 at 5:23 AM spiral <spiral@xxxxxxxxx> wrote:
Hey,

> That wait event according to documentation is "Waiting to access the

> multixact member SLRU cache."  SLRU = segmented least recently used

> cache

I see, thanks!

> if you are low on memory, it can slow down the allocation of

> buffers. Do you have a query that is a "select for update" running

> somewhere? If your disk is low on space `df -h` that might explain

> the issue.

- There aren't any queries that are running for longer than the selects

shown earlier; definitely not "select for update" since I don't ever

use that in my code.

- Both disk and RAM utilization is relatively low.

> Is there an ERROR: multixact  something in your postgres log?

There isn't, but while checking I saw some other concerning errors

including "deadlock detected", "could not map dynamic shared memory

segment" and "could not attach to dynamic shared area".

(full logs here: https://paste.sr.ht/blob/9ced99b119c3fce1ecfd71e8554946e7845a44dd )

> Another thing to look at is `iostat -x -y` and look at disk util %.

> This is an indicator, but not definitive, of how much disk access is

> going on.  It may be your drives are just saturated although your

> IOWait looks ok in your attachment.

I didn't specifically look at that, but I did notice *very* high disk

utilization in at least one instance of the stuck queries, as I

mentioned previously. Why would the disks be getting saturated? The

query count isn't noticeably higher than average, and the database

is not autovacuuming, so not sure what could cause that.

spiral