Hi experts, we have a cluster that crashed because of a full 'pg_wal' disk. When it automatically tried to restart after failure, it went into recovery, crashed again - this time in recovery - and stopped (see logs below [1]). The original problem is solved and the cluster now successfully started an finished recovery. But when I restarted the cluster it HINTed: 2025-02-20 09:26:57.615 CET [3982486] LOG: database system was interrupted while in recovery at 2025-02-20 03:38:24 CET 2025-02-20 09:26:57.615 CET [3982486] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. My question is now, if there is still a chance that the data is corrupted, although the interrupted recovery was successfully redone later? Should I recover the whole cluster from the last PITR backup to be safe? Or can I be sure that the data is in a solid state, since recovery succeeded on second try? Regards, Benjamin ---------------------- [1] logs: 2025-02-20 03:38:18.278 CET [3767151] PANIC: could not write to file "pg_wal/xlogtemp.3767151": No space left on device 2025-02-20 03:38:18.315 CET [1909] LOG: server process (PID 3767151) was terminated by signal 6: Aborted 2025-02-20 03:38:18.315 CET [1909] LOG: terminating any other active server processes 2025-02-20 03:38:21.938 CET [1909] LOG: all server processes terminated; reinitializing 2025-02-20 03:38:23.867 CET [3863518] LOG: database system was interrupted; last known up at 2025-02-20 03:34:49 CET 2025-02-20 03:38:24.089 CET [3863518] LOG: database system was not properly shut down; automatic recovery in progress 2025-02-20 03:38:24.102 CET [3863518] LOG: redo starts at E8A/EB648078 2025-02-20 03:38:24.119 CET [3863518] LOG: redo done at E8A/EC19B6B0 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.02 s 2025-02-20 03:38:24.136 CET [3863518] FATAL: could not write to file "pg_wal/xlogtemp.3863518": No space left on device 2025-02-20 03:38:24.139 CET [1909] LOG: startup process (PID 3863518) exited with exit code 1 2025-02-20 03:38:24.139 CET [1909] LOG: terminating any other active server processes 2025-02-20 03:38:24.140 CET [1909] LOG: shutting down due to startup process failure 2025-02-20 03:38:24.213 CET [1909] LOG: database system is shut down 2025-02-20 09:26:57.597 CET [3982483] LOG: starting PostgreSQL 15.10 (Debian 15.10-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit 2025-02-20 09:26:57.597 CET [3982483] LOG: listening on IPv4 address "0.0.0.0", port 5432 2025-02-20 09:26:57.597 CET [3982483] LOG: listening on IPv6 address "::", port 5432 2025-02-20 09:26:57.598 CET [3982483] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2025-02-20 09:26:57.615 CET [3982486] LOG: database system was interrupted while in recovery at 2025-02-20 03:38:24 CET 2025-02-20 09:26:57.615 CET [3982486] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. 2025-02-20 09:26:57.886 CET [3982486] LOG: database system was not properly shut down; automatic recovery in progress 2025-02-20 09:26:57.890 CET [3982486] LOG: redo starts at E8A/EB648078 2025-02-20 09:26:57.932 CET [3982486] LOG: redo done at E8A/EC19B6B0 system usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s 2025-02-20 09:26:57.966 CET [3982484] LOG: checkpoint starting: end-of-recovery immediate wait 2025-02-20 09:26:58.061 CET [3982484] LOG: checkpoint complete: wrote 390 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.058 s, sync=0.001 s, total=0.096 s; sync files=103, longest=0.001 s, average=0.001 s; distance=26335 kB, estimate=26335 kB 2025-02-20 09:26:58.068 CET [3982483] LOG: database system is ready to accept connections
Attachment:
smime.p7s
Description: S/MIME cryptographic signature