Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Sat, Dec 31, 2022 at 4:32 PM Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> Hi,
>
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
>
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout) should
> be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.

What version postgres?  What settings have non-default values ?
What OS/version?  What environment/hardware?  VM/image/provider/...
What are the queries that are running BIND ?  What parameter types ?
Are the slow BINDs failing?  Are their paramters being logged ?
What else is running besides postgres ?  Are the DB clients local or
remote ?  It shouldn't matter, but what client library?

What version of postgres? - 14.6

What settings have non-default values ? - a lot (it's 48 core Amazon EC2 server with 396GB of RAM)
(e.g. it carefully tuned database for particular workload)

What OS/version? - Ubuntu 20.04LTS

What environment/hardware? - 48 core Amazon EC2 server with 396GB of RAM and local NVME storage
(i3en.12xlarge)

What are the queries that are running BIND ?  - nothing special, e.g. during problem period a lot completely different queries become stuck in BIND and PARSE stage but no long duration (>100ms) EXECUTE calls found, in general it feel that whole BIND/PARSE mechanics lock for short period
==== LOG SAMPLE ==========================
2023-01-01 09:07:31.622 UTC 1848286 ****** from [local] [vxid:84/20886619 txid:0] [PARSE] LOG:  duration: 235.472 ms  parse <unnamed>: SELECT COUNT(*) FROM "job_stats_master" WHERE (job_stats_master.created_at >
= '2022-12-31 09:07:31.350000') AND (job_stats_master.created_at < '2023-01-01 09:07:31.350000') AND "job_stats_master"."employer_id" = **** AND "job_stats_master"."action" = 2 AND "job_stats_master"."job_board_id" = **** AND "job_stats_master"."ip_matching_id" = *****
2023-01-01 09:07:31.622 UTC 1898699 ******  from [local] [vxid:158/22054921 txid:0] [BIND] LOG:  duration: 231.274 ms  bind <unnamed>: SELECT id, name FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1898699 ******* from [local] [vxid:158/22054921 txid:0] [BIND] DETAIL:  parameters: $1 = '0'
2023-01-01 09:07:31.622 UTC 1794756 ******* from [local] [vxid:281/10515416 txid:0] [BIND] LOG:  duration: 231.024 ms  bind <unnamed>: SELECT id, name FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1794756 ******* from [local] [vxid:281/10515416 txid:0] [BIND] DETAIL:  parameters: $1 = '0'

... 5 pages of BIND/PARSE of different/unrelated to each other queries logged with over 100ms runtime

2023-01-01 09:07:31.623 UTC 1806315 ******* from [local] [vxid:231/17406673 txid:0] [BIND] LOG:  duration: 139.372 ms  bind <unnamed>: SELECT     employers.*,     third_party_employer_pixels.facebook_pixel_id AS facebook_pixel_id,     third_party_employer_pixels.google_pixel_id   AS google_pixel_id,     third_party_employer_pixels.google_actions    AS google_actions,     employer_pixel_configurations.solution        AS tracking_solution,     employer_pixel_configurations.domain_name     AS domain,     settings.use_multiple_bids   FROM employers     LEFT JOIN third_party_employer_pixels   ON third_party_employer_pixels.employer_id = employers.id     LEFT JOIN employer_pixel_configurations ON employer_pixel_configurations.id = employers.id     LEFT JOIN settings                      ON settings.id = employers.setting_id WHERE employers.id = $1
2023-01-01 09:07:31.623 UTC 1806315 ******* from [local] [vxid:231/17406673 txid:0] [BIND] DETAIL:  parameters: $1 = '*****'
2023-01-01 09:07:31.624 UTC 1806321 ******* from [local] [vxid:176/21846997 txid:0] [BIND] LOG:  duration: 120.237 ms  bind <unnamed>: SELECT job_boards.*, enterprises.product_type, feed_settings.use_employer_exported_name as use_employer_exported_name, integration_job_board_settings.integration_status as integration_status FROM job_boards LEFT JOIN integration_job_board_settings ON integration_job_board_settings.id = job_boards.integration_job_board_setting_id LEFT JOIN enterprises ON enterprises.id = job_boards.enterprise_id LEFT JOIN feed_settings ON feed_settings.id = job_boards.feed_setting_id WHERE job_boards.id = $1
2023-01-01 09:07:31.624 UTC 1806321 ******* from [local] [vxid:176/21846997 txid:0] [BIND] DETAIL:  parameters: $1 = '****'
===============================================================
What really curious in the log: that every of 100+ stuck in PARSE/BIND stage queries that had been logged (and thus unstuck) in the same exact moment... that highly likely means that they all had been stuck in the same single place.
E.g. something locked the whole PARSE/BIND machinery (but not an EXECUTE) for 200+ms.

Are the slow BINDs failing?
No, they all executed successfully later after being unstuck.

Are their parameters being logged ?
Yes.

What else is running besides postgres ?
Nothing else , dedicated DB server.

Are the DB clients local or remote ?
Remote all over a fast network.

It shouldn't matter, but what client library?
50% ROR (ruby on rails) / 50% java(jdbc).


Problem that issue happens only few times per 24 hour and usual duration under 1second
so it very hard to catch problem with perf or gdb or strace.



--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux