Postgresql crashdown and recovery failure

"xiebin (F)" <xiebin18@xxxxxxxxxx> · Fri, 11 Jun 2021 03:34:24 +0000

Hi,

My database (Postgresql 12.5) server uses remote storage as data directory.

I powered off the database server and started postgresql (same version) on another server, with same data. However it failed. I checked wal logs and found that the content around latest checkpoint
 is empty. (about 1700 zero bytes)

# pg_controldata -D .
pg_control version number:            1201
Catalog version number:               201909212
Database system identifier:           6967594878781603854
Database cluster state:               in production
pg_control last modified:             Wed 09 Jun 2021 04:23:24 PM CST
Latest checkpoint location:           0/9893978
Latest checkpoint's REDO location:    0/9893940
Latest checkpoint's REDO WAL file:    000000010000000000000009
Latest checkpoint's TimeLineID:       1
...

# pg_waldump -p ./pg_wal/ -s 0/9893900
first record is after 0/9893900, at 0/9893908, skipping over 8 bytes
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/09893908, prev 0/098938E0, desc: RUNNING_XACTS
 nextXid 30176 latestCompletedXid 30175 oldestRunningXid 30176
pg_waldump: fatal: error in WAL record at 0/9893908:
invalid record length at 0/9893940: wanted 24, got 0

# hexdump -C ./pg_wal/000000010000000000000009 -s
0x89393a -n 10000
0089393a  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00893ffa  00 00 00 00 00 00 01 d1  05 00 01 00 00 00 00 40 
|...............@|
0089400a  89 07 00 00 00 00 4b 00  00 00 00 00 00 00 00 00  |......K.........|
...

It seems the checkpoint record in wal buffer is not flushed to disk, leading to recovery failure.

The remote storage support Posix IO, meaning the data would be written to disk as long as postgresql called fdatasync (or other similar system calls). Please notice the storage server has
 not been shut down.

Besides, I traced the system calls of postgresql checkpoint (triggered by “psql –c checkpoint”).

# strace -fvy -s 64 pg_ctl -D /tmp/data start 2>&1 | grep –E ‘open|sync’
...

[pid 109705] openat(AT_FDCWD, "pg_xact/0000", O_RDWR|O_CREAT, 0600 <unfinished ...>
[pid 109705] <... openat resumed> )     = 3</tmp/data/pg_xact/0000>
[pid 109705] fsync(3</tmp/data/pg_xact/0000> <unfinished ...>
[pid 109705] <... fsync resumed> )      = 0
[pid 109705] openat(AT_FDCWD, "pg_xact", O_RDONLY) = 3</tmp/data/pg_xact>
[pid 109705] fsync(3</tmp/data/pg_xact>) = 0
[pid 109705] openat(AT_FDCWD, "pg_commit_ts", O_RDONLY) = 3</tmp/data/pg_commit_ts>
[pid 109705] fsync(3</tmp/data/pg_commit_ts>) = 0
[pid 109705] openat(AT_FDCWD, "pg_subtrans/0000", O_RDWR|O_CREAT, 0600) = 3</tmp/data/pg_subtrans/0000>
[pid 109705] openat(AT_FDCWD, "pg_multixact/offsets/0000", O_RDWR|O_CREAT, 0600) = 3</tmp/data/pg_multixact/offsets/0000>
[pid 109705] fsync(3</tmp/data/pg_multixact/offsets/0000>) = 0
[pid 109705] openat(AT_FDCWD, "pg_multixact/offsets", O_RDONLY) = 3</tmp/data/pg_multixact/offsets>
[pid 109705] fsync(3</tmp/data/pg_multixact/offsets>) = 0
[pid 109705] openat(AT_FDCWD, "pg_multixact/members", O_RDONLY) = 3</tmp/data/pg_multixact/members>
[pid 109705] fsync(3</tmp/data/pg_multixact/members>) = 0
[pid 109705] openat(AT_FDCWD, "pg_logical/snapshots", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3</tmp/data/pg_logical/snapshots>
[pid 109705] openat(AT_FDCWD, "pg_logical/mappings", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3</tmp/data/pg_logical/mappings>
[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint.tmp", O_WRONLY|O_CREAT|O_EXCL, 0600) = 3</tmp/data/pg_logical/replorigin_checkpoint.tmp>
[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint.tmp", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint.tmp>
[pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint.tmp>) = 0
[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint>
[pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint>) = 0
[pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint>
[pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint>) = 0
[pid 109705] openat(AT_FDCWD, "pg_logical", O_RDONLY) = 3</tmp/data/pg_logical>
[pid 109705] fsync(3</tmp/data/pg_logical>) = 0
[pid 109705] openat(AT_FDCWD, "pg_wal/00000001000000000000000B", O_RDWR <unfinished ...>
[pid 109705] <... openat resumed> )     = 3</tmp/data/pg_wal/00000001000000000000000B>
[pid 109705]
fdatasync(3</tmp/data/pg_wal/00000001000000000000000B> <unfinished ...>          (Let’s say it is
A)
[pid 109705] <... fdatasync resumed> )  = 0
[pid 109705] openat(AT_FDCWD, "/tmp/data/global/pg_control", O_RDWR <unfinished ...>
[pid 109705] <... openat resumed> )     = 4</tmp/data/global/pg_control>
[pid 109705]
fsync(4</tmp/data/global/pg_control> <unfinished ...>                               (Let’s say it is
B)
[pid 109705] <... fsync resumed> )      = 0
[pid 109705] openat(AT_FDCWD, "pg_wal", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4</tmp/data/pg_wal>
[pid 109705] openat(AT_FDCWD, "pg_subtrans", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4</tmp/data/pg_subtrans>
[pid 109707]
fdatasync(3</tmp/data/pg_wal/00000001000000000000000B>) = 0                        (Let’s say it is
C)

...

>From the output, I can see that data files, pg_control file and wal log are flushed to disk. And I noticed that pg_control file is flushed between 2 wal flush (as noted A,B,C).

Here are my questions:

1. 
If power off occurs between B and C. Would it be possible that checkpoint was written to pg_control but not to wal log? And as a consequence, postgresql recovery would fail?

2. 
Is there any other reason causing this problem?

3. 
Is there any way to recover the data? I know pg_resetwal can reset wal log, then postgresql can start. But undoubtedly it could cause some data loss.

Thanks,
Xie Bin