On Sat, Jan 9, 2021 at 2:07 PM Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
How are you monitoring the COMMIT times? What do you generally see in pg_stat_activity.wait_event during the spikes/stalls?
Right now we just observe the COMMIT duration posted in the postgresql log (we log anything over 100ms).
One other thing that I shamefully forgot to mention. When we see these slow COMMITs in the log, they coincide with a connection storm (Cat 5 hurricane) from our apps where connections will go from ~200 to ~1200. This will probably disgust many, but our PG server's max_connections is set to 2000. We have a set of pgbouncers in front of this with a total max_db_connections of 1600. I know many of you think this defeats the whole purpose of having pgbouncer and I agree. I've been trying to explain as much and that even with 32 CPUs on this DB host, we probably shouldn't expect to be able to support more than 100-200 active connections, let alone 1600. I'm still pushing to have our app server instances (which also use their own JDBC (Hikari) connection pool and *then* go through pgbouncer) to lower their min/max connection settings but obviously it's sort of counterintuitive at first glance but hopefully everyone sees the bigger picture.
One nagging question I have is if the slow COMMIT is triggering the connection storm (eg app sees slow response or timeout from a current connection and fires off a new connection in its place), or vice-versa. We're planning to deploy new performant cloud storage (Azure Ultra disk) just for WAL logs but I'm hesitant to say it'll be a silver bullet when we still have this insane connection management strategy in place.
Curious to know what others think (please pull no punches) and if others have been in a similar scenario with anecdotes to share.
Thanks,
Don.
Don Seiler
www.seiler.us
www.seiler.us