On Sat, Jan 2, 2021 at 12:38 PM Li EF Zhang <bjzhangl@xxxxxxxxxx> wrote: > > We have a postgresql database cluster with 3 node, one is primary, the other 2 are secondary. The cluster is managed by patroni. The 3 databases are in respective containers. When we update one secondary container image, the database in this container can not start. It reports "database system is starting up". We tried to delete the container and recover it from primary, the same error is reported. > > I checked the db log, there is a message: > DEBUG: recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 4739126 (now 1422751)#0122020-12-15 07:40:08.513 UTC [146-5182] CONTEXT: WAL redo at 3/ED11B020 for Standby/RUNNING_XACTS: nextXid 4739482 latestCompletedXid 4739475 oldestRunningXid 1422751; 16 xacts: 2716862 2721890 4665244 2495592 2289138 2288820 2287653 1422751 4280517 2288510 2287620 3297674 1757103 4739326 3320989 2259670; subxid ovf > > > Seems snapshot overflowed which causes the secondary can not start up. I am newer to postgresql. I do not know very clearly how this happened and how to fix it. Thanks! AFAICT it looks like you have a very old running transaction on the system (transaction id 1422751 is 3 million transactions ago, and there are other transactions also very old). As mentioned on https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-ADMIN, having very long running transactions in combinations with many subtransactions can cause it to take a long time for the system to reach a consistent state. See also the caveat list at the bottom of that page. If those transactions are such that they're needed, you have no other choice but to wait I believe. But in (the more likely, I'd say) they represent hung or otherwise broken clients, then terminating those sessions on the primary should help the process along. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/