On Wed, Aug 31, 2016 at 8:07 AM, Chinh Nguyen <chinhngt@xxxxxxxxx> wrote:
Hello All,
We recently tried to reindex on a heavy used database cluster (master
+ multiple hot-standby async replicas, all taking traffic) and
replicas were effectively blocked for 10 minutes resulting in drop of
a lot of read traffic. We reindex by create new index concurrently,
then rename old and new index and drop old index afterwards.
After we execute the renaming on master, we found a quick burst of
exclusive locks as expected:
Process 34482 waits for AccessShareLock on relation 4153657 of
database 16420; blocked by process 18953.
Process 18953: ALTER INDEX public.old_index RENAME TO temp_index;
But all hot-standby replicas started to see queries locking up for 10
minutes, sample log from a replica is below:
LOG: process 41040 still waiting for AccessShareLock on relation
4153657 of database 16420 after 1000.072 ms
We have in config:
max_standby_streaming_delay = 10min
lock_timeout = 10s
Any idea what happened with the replicas? Why queries on replica
locked for so long vs sub-second on master?
I think, it is not the lock release wait time, it is the delay in WAL replay on standby from master due to the parameter "max_standby_streaming_delay".
If you have read queries running on standby, WAL replay waits for "max_standby_streaming_delay" duration (which is 10 mins in your case) before timing out queries on replica.
Regards,
Venkata B N
Fujitsu Australia