Micheal,
Thanks for much for the quick response.
On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote: The reindex went fine in the primary database and in one of our standby. The other standby that we also operate for some reason ended up in a state where all transactions were locked by the WAL process and the WAL process was not able to make any progress. In order to solve this issue we had to move traffic from the “bad” standby to the healthy one and then kill all transactions that were running in the “bad” standby. After that, replication was able to resume successfully.
You are referring to the startup process that replays WAL, right? That is correct, I’m talking about the startup process that replays the WAL files.
Without having an idea about the type of workload your primary and/or standbys are facing, as well as an idea of the configuration you are using on both (hot_standby_feedback for one), I have no direct idea,
Primary handle IOT data ingestion. The table that we had to REINDEX gets updated every time a new message arrives in the system so updated are happening very often on that table, thus, the index/table bloat. The standby at any point in time would be receiving queries that would take advantage of the indexes that were being re indexed. hot_standby_feedback is currently turned OFF on the standbys. but that could be a conflict caused by a concurrent vacuum.
Seeing where things got stuck could also be useful, perhaps with a backtrace of the area where it happens and some information around it. I’m just trying to understand what could have caused this issue. I was not able to identify any queries in the standby that would be locking the WAL process. Any insight would be more than welcome!
That's not going to be easy without more information, I am afraid. -- Michael
|