> What additional information is needed? Usually server logs and the output of pg_rewind at the trouble time are needed as the first step. > Next, pg_rewind returns errors while reading the log from the backup > back, looking for the last checkpoint, which is quite reasonable > because, once a new leader starts, the point of divergence normally > ends up in the next timeline and the previous timeline's backup log > does not have a block with the LSN of the divergence. That sounds like pg_rewind is a crap. pg_rewind reads timeline history files from the both servers to find the last timeline up to where the two servers share the same history then determine the divergence point at the latest LSN where the two servers are known to share. Then it overwrites the pages modified since the common checkpoint until the last (shutdown) checkpoint on the previous leader that are modified in the *previous* timeline on the former leader by the data of the same pages *on the new leader*. No need for page data from the older timeline. If nothing's going wrong, pg_rewind is not expected to face the situation of: > could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50 > could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, Which means the WAL files are somehow broken. > When pg_rewind is run, it also uses the log from the backup (the > lagging log from the new leader) instead of the partial log with > which the former leader has already been started. I don't see how come the former leader doesn't have access to the partial log (or the latest WAL file, I suppose)? It is essential for pg_rewind to work (since it exists nowhere other than there) and it must be in pg_wal directory unless someone removed it. Thus, I think we need the exact steps you and your system took after the failover happened about postgresql. regards. -- Kyotaro Horiguchi NTT Open Source Software Center