Hi, My database (Postgresql 12.5) server uses remote storage as data directory.
I powered off the database server and started postgresql (same version) on another server, with same data. However it failed. I checked wal logs and found that the content around latest checkpoint
is empty. (about 1700 zero bytes) # pg_controldata -D . pg_control version number: 1201 Catalog version number: 201909212 Database system identifier: 6967594878781603854 Database cluster state: in production pg_control last modified: Wed 09 Jun 2021 04:23:24 PM CST Latest checkpoint location: 0/9893978 Latest checkpoint's REDO location: 0/9893940 Latest checkpoint's REDO WAL file: 000000010000000000000009 Latest checkpoint's TimeLineID: 1 ... # pg_waldump -p ./pg_wal/ -s 0/9893900 first record is after 0/9893900, at 0/9893908, skipping over 8 bytes rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/09893908, prev 0/098938E0, desc: RUNNING_XACTS
nextXid 30176 latestCompletedXid 30175 oldestRunningXid 30176 pg_waldump: fatal: error in WAL record at 0/9893908:
invalid record length at 0/9893940: wanted 24, got 0 # hexdump -C ./pg_wal/000000010000000000000009 -s
0x89393a -n 10000 0089393a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00893ffa 00 00 00 00 00 00 01 d1 05 00 01 00 00 00 00 40
|...............@| 0089400a 89 07 00 00 00 00 4b 00 00 00 00 00 00 00 00 00 |......K.........| ... It seems the checkpoint record in wal buffer is not flushed to disk, leading to recovery failure. The remote storage support Posix IO, meaning the data would be written to disk as long as postgresql called fdatasync (or other similar system calls). Please notice the storage server has
not been shut down. Besides, I traced the system calls of postgresql checkpoint (triggered by “psql –c checkpoint”). # strace -fvy -s 64 pg_ctl -D /tmp/data start 2>&1 | grep –E ‘open|sync’ ... [pid 109705] openat(AT_FDCWD, "pg_xact/0000", O_RDWR|O_CREAT, 0600 <unfinished ...> [pid 109705] <... openat resumed> ) = 3</tmp/data/pg_xact/0000> [pid 109705] fsync(3</tmp/data/pg_xact/0000> <unfinished ...> [pid 109705] <... fsync resumed> ) = 0 [pid 109705] openat(AT_FDCWD, "pg_xact", O_RDONLY) = 3</tmp/data/pg_xact> [pid 109705] fsync(3</tmp/data/pg_xact>) = 0 [pid 109705] openat(AT_FDCWD, "pg_commit_ts", O_RDONLY) = 3</tmp/data/pg_commit_ts> [pid 109705] fsync(3</tmp/data/pg_commit_ts>) = 0 [pid 109705] openat(AT_FDCWD, "pg_subtrans/0000", O_RDWR|O_CREAT, 0600) = 3</tmp/data/pg_subtrans/0000> [pid 109705] openat(AT_FDCWD, "pg_multixact/offsets/0000", O_RDWR|O_CREAT, 0600) = 3</tmp/data/pg_multixact/offsets/0000> [pid 109705] fsync(3</tmp/data/pg_multixact/offsets/0000>) = 0 [pid 109705] openat(AT_FDCWD, "pg_multixact/offsets", O_RDONLY) = 3</tmp/data/pg_multixact/offsets> [pid 109705] fsync(3</tmp/data/pg_multixact/offsets>) = 0 [pid 109705] openat(AT_FDCWD, "pg_multixact/members", O_RDONLY) = 3</tmp/data/pg_multixact/members> [pid 109705] fsync(3</tmp/data/pg_multixact/members>) = 0 [pid 109705] openat(AT_FDCWD, "pg_logical/snapshots", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3</tmp/data/pg_logical/snapshots> [pid 109705] openat(AT_FDCWD, "pg_logical/mappings", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3</tmp/data/pg_logical/mappings> [pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint.tmp", O_WRONLY|O_CREAT|O_EXCL, 0600) = 3</tmp/data/pg_logical/replorigin_checkpoint.tmp> [pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint.tmp", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint.tmp> [pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint.tmp>) = 0 [pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint> [pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint>) = 0 [pid 109705] openat(AT_FDCWD, "pg_logical/replorigin_checkpoint", O_RDWR) = 3</tmp/data/pg_logical/replorigin_checkpoint> [pid 109705] fsync(3</tmp/data/pg_logical/replorigin_checkpoint>) = 0 [pid 109705] openat(AT_FDCWD, "pg_logical", O_RDONLY) = 3</tmp/data/pg_logical> [pid 109705] fsync(3</tmp/data/pg_logical>) = 0 [pid 109705] openat(AT_FDCWD, "pg_wal/00000001000000000000000B", O_RDWR <unfinished ...> [pid 109705] <... openat resumed> ) = 3</tmp/data/pg_wal/00000001000000000000000B> [pid 109705]
fdatasync(3</tmp/data/pg_wal/00000001000000000000000B> <unfinished ...> (Let’s say it is
A) [pid 109705] <... fdatasync resumed> ) = 0 [pid 109705] openat(AT_FDCWD, "/tmp/data/global/pg_control", O_RDWR <unfinished ...> [pid 109705] <... openat resumed> ) = 4</tmp/data/global/pg_control> [pid 109705]
fsync(4</tmp/data/global/pg_control> <unfinished ...> (Let’s say it is
B) [pid 109705] <... fsync resumed> ) = 0 [pid 109705] openat(AT_FDCWD, "pg_wal", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4</tmp/data/pg_wal> [pid 109705] openat(AT_FDCWD, "pg_subtrans", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4</tmp/data/pg_subtrans> [pid 109707]
fdatasync(3</tmp/data/pg_wal/00000001000000000000000B>) = 0 (Let’s say it is
C) ... >From the output, I can see that data files, pg_control file and wal log are flushed to disk. And I noticed that pg_control file is flushed between 2 wal flush (as noted A,B,C). Here are my questions:
1.
If power off occurs between B and C. Would it be possible that checkpoint was written to pg_control but not to wal log? And as a consequence, postgresql recovery would fail?
2.
Is there any other reason causing this problem?
3.
Is there any way to recover the data? I know pg_resetwal can reset wal log, then postgresql can start. But undoubtedly it could cause some data loss. Thanks, Xie Bin |