Re: May data be corrupted after an interrupted, but afterwards sucessfully replayed recovery?

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Fri, 21 Feb 2025 08:01:02 +0100

On Thu, 2025-02-20 at 15:00 +0000, Dietrich, Benjamin wrote:
> we have a cluster that crashed because of a full 'pg_wal' disk.
> When it automatically tried to restart after failure, it went into recovery,
> crashed again - this time in recovery - and stopped (see logs below [1]).
> 
> The original problem is solved and the cluster now successfully started an
> finished recovery.
> But when I restarted the cluster it HINTed:
> 
> 2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
> while in recovery at 2025-02-20 03:38:24 CET
> 2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
> data is corrupted and you will have to use the last backup for recovery.
> 
> 
> My question is now, if there is still a chance that the data is corrupted,
> although the interrupted recovery was successfully redone later?

No, if you didn't take other steps to break the database, like remove files
or run "pg_resetwal", everything should be in perfet order.

Perhaps that hint should be defused to say "This could mean that..."

> ----------------------
> [1] logs:
> 
> 2025-02-20 03:38:18.278 CET [3767151] PANIC:  could not write to file
> "pg_wal/xlogtemp.3767151": No space left on device
> [...]
> 2025-02-20 03:38:23.867 CET [3863518] LOG:  database system was interrupted;
> last known up at 2025-02-20 03:34:49 CET
> [...]
> 2025-02-20 03:38:24.136 CET [3863518] FATAL:  could not write to file
> "pg_wal/xlogtemp.3863518": No space left on device
> [...]
> 2025-02-20 09:26:57.615 CET [3982486] LOG:  database system was interrupted
> while in recovery at 2025-02-20 03:38:24 CET
> 2025-02-20 09:26:57.615 CET [3982486] HINT:  This probably means that some
> data is corrupted and you will have to use the last backup for recovery.
> 2025-02-20 09:26:57.886 CET [3982486] LOG:  database system was not properly
> shut down; automatic recovery in progress
> 2025-02-20 09:26:57.890 CET [3982486] LOG:  redo starts at E8A/EB648078
> 2025-02-20 09:26:57.932 CET [3982486] LOG:  redo done at E8A/EC19B6B0 system
> usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s
> 2025-02-20 09:26:57.966 CET [3982484] LOG:  checkpoint starting:
> end-of-recovery immediate wait
> 2025-02-20 09:26:58.061 CET [3982484] LOG:  checkpoint complete: wrote 390
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.058 s,
> sync=0.001 s, total=0.096 s; sync files=103, longest=0.001 s, average=0.001
> s; distance=26335 kB, estimate=26335 kB
> 2025-02-20 09:26:58.068 CET [3982483] LOG:  database system is ready to
> accept connections

That confirms that everything worked the way it should, and your database
will be fine.

Yours,
Laurenz Albe

-- 

*E-Mail Disclaimer*
Der Inhalt dieser E-Mail ist ausschliesslich fuer den 
bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat 
dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte, 
dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder 
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich 
in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen.

*CONFIDENTIALITY NOTICE & DISCLAIMER
*This message and any attachment are 
confidential and may be privileged or otherwise protected from disclosure 
and solely for the use of the person(s) or entity to whom it is intended. 
If you have received this message in error and are not the intended 
recipient, please notify the sender immediately and delete this message and 
any attachment from your system. If you are not the intended recipient, be 
advised that any use of this message is prohibited and may be unlawful, and 
you must not copy this message or attachment or disclose the contents to 
any other person.