Good morning,
This week we've noticed that we're starting to see spikes where COMMITs are taking much longer than usual. Sometimes, quite a few seconds to finish. After a few minutes they disappear but then return seemingly at random. This becomes visible to the app and end user as a big stall in activity.
The checkpoints are still running for their full 5 min checkpoint_timeout duration (logs all say "checkpoint starting: time" and I'm not seeing any warnings about them occurring too frequently.
This is PostgreSQL 12.4 on Ubuntu 18.04, all running in MS Azure (*not* managed by them).
# select version();
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.4 (Ubuntu 12.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.4 (Ubuntu 12.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
I have the stats_temp_directory in a tmpfs mount. I do have pg_wal on the same premium SSD storage volume as the data directory. Normally I would know to separate these but I was told with the cloud storage that it's all virtualized anyway, plus storage IOPS are determined by disk size so having a smaller volume just for pg_wal would hurt me in this case. The kind folks in the PG community Slack suggested just having one large premium cloud storage mount for the data directory and leave pg_wal inside because this virtualization removes any guarantee of true separation.
I'm wondering if others have experience running self-managed PG in a cloud setting (especially if in MS Azure) and what they might have seen/done in cases like this.
Thanks,