Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

Maxim Boguk <maxim.boguk@xxxxxxxxx> · Thu, 5 Jan 2023 12:57:00 +0200

On Sat, Dec 31, 2022 at 2:26 PM Maxim Boguk <maxim.boguk@xxxxxxxxx> wrote:
Hi,

When performing post-mortem analysis of some short latency spikes on a heavily loaded database, I found that the reason for (less than 10 second latency spike) wasn't on the EXECUTE stage but on the BIND stage.
At the same time graphical monitoring shows that during this few second period there were some queries waiting in the BIND stage.

Logging setup:
log_min_duration_statement=200ms
log_lock_waits=on
deadlock_timeout=100ms
So I expected that every lock waiting over 100ms (>deadlock_timeout) should be in the log.
But in the log I see only spikes on slow BIND but not lock waits logged.
(
grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration' | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
...
      9 2022-12-29 00:12:5
      2 2022-12-29 00:13:1
      3 2022-12-29 00:13:5
!!!  68 2022-12-29 00:14:0
      5 2022-12-29 00:14:1
      3 2022-12-29 00:14:2
      2 2022-12-29 00:14:3
).
But no lock waits on the BIND stage logged during the problem period (and no lock waits in general).
Similar issues happen a few times per day without any visible pattern (but on the same tables usually).
No CPU or IO load/latency spikes found during problem periods.
No EXECUTE slowdown found in the log during that time.

Followup research of this issue lead me to following results:
Every logged spike of BIND/PARSE response time correlated with corresponding backend waiting on
wait_event_type = LWLock
wait_event = pg_stat_statements
and all of these spikes happen during increment of pg_stat_statements_info.dealloc counter.

Some searching about this issue lead me to following blog post about similar issue:
https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck.html

However, we already have pg_stat_statements.max=10000 so further increase of this parameter
seems counterproductive (the size of 14/main/pg_stat_tmp/pgss_query_texts.stat is already over 20MB).

Open questions remains:
1)Is it expected behaviour of pg_stat_statements to block every BIND/PARSE during deallocation of least used entries for the whole period of cleanup?

2)Any recommended workaround for this issue for systems with strict latency SLA
(block every database query (used extended query protocol) for 200-500ms 50+ times per day at random time - isn't acceptable for our case unfortunately)?

3)Why only BIND/PARSE locks but not EXECUTE?
(may be some difference in implementation of plan vs exec pg_stat_statements counters?).

Kind Regards,
Maxim

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678