May data be corrupted after an interrupted, but afterwards sucessfully replayed recovery?

"Dietrich, Benjamin" <b.dietrich@xxxxxxxxxxxxxxxx> · Thu, 20 Feb 2025 15:00:35 +0000

Hi experts,

we have a cluster that crashed because of a full 'pg_wal' disk.
When it automatically tried to restart after failure, it went into recovery,
crashed again - this time in recovery - and stopped (see logs below [1]).

The original problem is solved and the cluster now successfully started an
finished recovery.
But when I restarted the cluster it HINTed:

2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
while in recovery at 2025-02-20 03:38:24 CET
2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
data is corrupted and you will have to use the last backup for recovery.

My question is now, if there is still a chance that the data is corrupted,
although the interrupted recovery was successfully redone later?
Should I recover the whole cluster from the last PITR backup to be safe? Or
can I be sure that the data is in a solid state, since recovery succeeded on
second try?

Regards,
Benjamin 

----------------------
[1] logs:

2025-02-20 03:38:18.278 CET [3767151] PANIC:  could not write to file
"pg_wal/xlogtemp.3767151": No space left on device
2025-02-20 03:38:18.315 CET [1909] LOG:  server process (PID 3767151) was
terminated by signal 6: Aborted
2025-02-20 03:38:18.315 CET [1909] LOG:  terminating any other active server
processes
2025-02-20 03:38:21.938 CET [1909] LOG:  all server processes terminated;
reinitializing
2025-02-20 03:38:23.867 CET [3863518] LOG:  database system was interrupted;
last known up at 2025-02-20 03:34:49 CET
2025-02-20 03:38:24.089 CET [3863518] LOG:  database system was not properly
shut down; automatic recovery in progress
2025-02-20 03:38:24.102 CET [3863518] LOG:  redo starts at E8A/EB648078
2025-02-20 03:38:24.119 CET [3863518] LOG:  redo done at E8A/EC19B6B0 system
usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s
2025-02-20 03:38:24.136 CET [3863518] FATAL:  could not write to file
"pg_wal/xlogtemp.3863518": No space left on device
2025-02-20 03:38:24.139 CET [1909] LOG:  startup process (PID 3863518)
exited with exit code 1
2025-02-20 03:38:24.139 CET [1909] LOG:  terminating any other active server
processes
2025-02-20 03:38:24.140 CET [1909] LOG:  shutting down due to startup
process failure
2025-02-20 03:38:24.213 CET [1909] LOG:  database system is shut down
2025-02-20 09:26:57.597 CET [3982483] LOG:  starting PostgreSQL 15.10
(Debian 15.10-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian
12.2.0-14) 12.2.0, 64-bit
2025-02-20 09:26:57.597 CET [3982483] LOG:  listening on IPv4 address
"0.0.0.0", port 5432
2025-02-20 09:26:57.597 CET [3982483] LOG:  listening on IPv6 address "::",
port 5432
2025-02-20 09:26:57.598 CET [3982483] LOG:  listening on Unix socket
"/var/run/postgresql/.s.PGSQL.5432"
2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
while in recovery at 2025-02-20 03:38:24 CET
2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
data is corrupted and you will have to use the last backup for recovery.
2025-02-20 09:26:57.886 CET [3982486] LOG:  database system was not properly
shut down; automatic recovery in progress
2025-02-20 09:26:57.890 CET [3982486] LOG:  redo starts at E8A/EB648078
2025-02-20 09:26:57.932 CET [3982486] LOG:  redo done at E8A/EC19B6B0 system
usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s
2025-02-20 09:26:57.966 CET [3982484] LOG:  checkpoint starting:
end-of-recovery immediate wait
2025-02-20 09:26:58.061 CET [3982484] LOG:  checkpoint complete: wrote 390
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.058 s,
sync=0.001 s, total=0.096 s; sync files=103, longest=0.001 s, average=0.001
s; distance=26335 kB, estimate=26335 kB
2025-02-20 09:26:58.068 CET [3982483] LOG:  database system is ready to
accept connections
Attachment:
smime.p7s

Description: S/MIME cryptographic signature