Hello,
I'm trying to figure out why we had a build up of connections on our streaming replica. We're running postgres 9.3.5 on the master and 9.3.10 on the replica, linux 3.2.0 on both, disks in raid10. Here are some graphs to illustrate what happened:
There were 3 different instances, as shown by the spikes in server connections graph. Each incident, the replica was still serving some requests but much slower, and connections kept building up. Pg_cancel_backend(), pg_terminate_backend() and the server timeout of 5s weren't closing them. Eventually we had to restart the database to get back to normal which isn't ideal.
The second graph shows the rows updated and hot updated on the master (n_tup_upd and n_tup_hot_upd from pg_stat_user_tables) on a large, frequently updated table (27M rows, 20GB). It looks like hot updates start going down right before connections run away. Going through the code, I found this comment: https://github.com/postgres/postgres/blob/REL9_3_5/src/backend/access/heap/heapam.c#L3031-L3041
Which from my understanding means that a hot update is a weaker lock, but if column is being "detoasted" an update will use the stronger lock instead. I'm not sure what detoasted means, but I do know that table has a toast table as it's using an hstore column. We can store pretty large values in the hstore column, so I wouldn't be surprised if that was causing the issue.
The third graph shows toast block cache hits o (toast_blks_hit from pg_statio_user_tables) and shows that the hits on the replica (purple) took a nose dive after the first incident and pretty much never recovered since then. Does that mean the toast table is too large to fit in cache?
The fourth graph shows the rate of dead row creation on the master, with the large blue area being the table in question. I'm assuming the dead rows are from updates on the hstore, but not certain.
The fifth graph shows the disk write throughput on the replica, which I'm assuming is from applying the WAL updates on that large table? Write throughput was higher than normal, but not maxing out the capacity of the disk. Write latency was minimal.
The 6th graph is the replica lag, which only fell behind during the first incident and not the others.
My question is why was the postgres replica unable to close connections during these incidents? And why were reads on the replica so slow at the time? Are there locks that are locking out the readers when there's lots of update churn?
Carlo Cabanilla
Datadog