postgres: WAL ends befor end of online backup

Moorthy RS <rsmoorthy@xxxxxxxxx> · Thu, 12 Nov 2020 18:17:53 +0530

We are running postgres 9.6, with 10+TB size. Backups have been taken using a homegrown tool "pgrsync", which uses S3 as the repository. Both the backup files and WAL archives are stored on S3.
Problem: While trying to restore, the WAL archives restore randomly fails for some backups with the following message in logs:
2020-11-12 06:33:32 UTC [10037]: [27988-1] user=,db=LOG:  redo done at 5493D/2EFFF568                                                                                                        
2020-11-12 06:33:32 UTC [10037]: [27989-1] user=,db=LOG:  last completed transaction was at log time 2020-11-06 12:31:27.796805+00                                                           
2020-11-12 06:33:34 UTC [10037]: [27990-1] user=,db=LOG:  restored log file "000000020005493D0000002E" from archive                                                                          
2020-11-12 06:33:34 UTC [10037]: [27991-1] user=,db=FATAL:  WAL ends before end of online backup                                                                                             
2020-11-12 06:33:34 UTC [10037]: [27992-1] user=,db=HINT:  All WAL generated while online backup was taken must be available at recovery.                                                    
2020-11-12 06:33:36 UTC [10033]: [3-1] user=,db=LOG:  startup process (PID 10037) exited with exit code 1                                                                                    
2020-11-12 06:33:36 UTC [10033]: [4-1] user=,db=LOG:  terminating any other active server processes                                                                                          
2020-11-12 06:33:48 UTC [10033]: [5-1] user=,db=LOG:  database system is shut down                     
In this case, the backup start location is 00000002000544C60000006B and stop location is 00000002000545210000008D, (based on pg_stop_backup() output) but it stops in between at 005493D and terminates the restore. If I redo the restore again, it stops exactly at the same point. Similar results from couple of more backups, while other backups successfully restores.
It is possibly an indication of some specific WAL files got corrupted during the backup/restore process. Is that the correct interpretation?
Questions:
Are there any ways of identifying if the WAL file is corrupted?
Is there a way to move ahead without losing data? (I am wary of using pg_resetxlog)