On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote: > The reindex went fine in the primary database and in one of our > standby. The other standby that we also operate for some reason > ended up in a state where all transactions were locked by the WAL > process and the WAL process was not able to make any progress. In > order to solve this issue we had to move traffic from the “bad” > standby to the healthy one and then kill all transactions that were > running in the “bad” standby. After that, replication was able to > resume successfully. You are referring to the startup process that replays WAL, right? Without having an idea about the type of workload your primary and/or standbys are facing, as well as an idea of the configuration you are using on both (hot_standby_feedback for one), I have no direct idea, but that could be a conflict caused by a concurrent vacuum. Seeing where things got stuck could also be useful, perhaps with a backtrace of the area where it happens and some information around it. > I’m just trying to understand what could have caused this issue. I > was not able to identify any queries in the standby that would be > locking the WAL process. Any insight would be more than welcome! That's not going to be easy without more information, I am afraid. -- Michael
Attachment:
signature.asc
Description: PGP signature