We are running postgres 9.6, with 10+TB size. Backups have been taken using a homegrown tool "pgrsync", which uses S3 as the repository. Both the backup files and WAL archives are stored on S3.
Problem: While trying to restore, the WAL archives restore randomly fails for some backups with the following message in logs:
2020-11-12 06:33:32 UTC [10037]: [27988-1] user=,db=LOG: redo done at 5493D/2EFFF568
2020-11-12 06:33:32 UTC [10037]: [27989-1] user=,db=LOG: last completed transaction was at log time 2020-11-06 12:31:27.796805+00
2020-11-12 06:33:34 UTC [10037]: [27990-1] user=,db=LOG: restored log file "000000020005493D0000002E" from archive
2020-11-12 06:33:34 UTC [10037]: [27991-1] user=,db=FATAL: WAL ends before end of online backup
2020-11-12 06:33:34 UTC [10037]: [27992-1] user=,db=HINT: All WAL generated while online backup was taken must be available at recovery.
2020-11-12 06:33:36 UTC [10033]: [3-1] user=,db=LOG: startup process (PID 10037) exited with exit code 1
2020-11-12 06:33:36 UTC [10033]: [4-1] user=,db=LOG: terminating any other active server processes
2020-11-12 06:33:48 UTC [10033]: [5-1] user=,db=LOG: database system is shut down
In this case, the backup start location is 00000002000544C60000006B
and stop location is 00000002000545210000008D
, (based on pg_stop_backup() output) but it stops in between at 005493D and terminates the restore. If I redo the restore again, it stops exactly at the same point. Similar results from couple of more backups, while other backups successfully restores.
It is possibly an indication of some specific WAL files got corrupted during the backup/restore process. Is that the correct interpretation?
Questions:
- Are there any ways of identifying if the WAL file is corrupted?
- Is there a way to move ahead without losing data? (I am wary of using pg_resetxlog)