On Sun, May 7, 2023 at 1:21 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > Thomas Munro <thomas.munro@xxxxxxxxx> writes: > > Did you previously run this same workload on versions < 15 and never > > see any problem? 15 gained a new feature CREATE DATABASE ... > > STRATEGY=WAL_LOG, which is also the default. I wonder if there is a > > bug somewhere near that, though I have no specific idea. > > Per the release notes I was just writing ... > > <listitem> > <!-- > Author: Michael Paquier <michael@xxxxxxxxxxx> > Branch: master [8a8661828] 2023-02-22 10:14:52 +0900 > Branch: REL_15_STABLE [fa5dd460c] 2023-02-22 10:14:56 +0900 > --> > <para> > Fix potential corruption of the template (source) database after > <command>CREATE DATABASE</command> with the <literal>STRATEGY > WAL_LOG</literal> option (Nathan Bossart, Ryo Matsumura) > </para> Hmm. That bug seems to have caused corruption (backwards time travel) of blocks in the *source* DB's pg_class, by failing to write back changes. We seem to have zeroed pages in the *target* database, for all catalogs (apparently everything copied by RelationCopyStorageUsingBuffer()), even though the template is still fine. It is as if RelationCopyStorageUsingBuffer() created the zero-filed file with smgrextend(), but then the buffer data was never written out even though we memcpy'd it into the a buffer and set the buffer dirty. Bug-in-PostgreSQL explanations could include that we forgot it was dirty, or some backend wrote it out to the wrong file; but if we were forgetting something like permanent or dirty, would there be a more systematic failure? Oh, it could require special rare timing if it is similar to 8a8661828's confusion about permanence level or otherwise somehow not setting BM_PERMANENT, but in the target blocks, so I think that'd require a checkpoint AND a crash. It doesn't reproduce for me, but perhaps more unlucky ingredients are needed. Bug-in-OS/FS explanations could include that a whole lot of writes were mysteriously lost in some time window, so all those files still contain the zeroes we write first in smgrextend(). I guess this previously rare (previously limited to hash indexes?) use of sparse file hole-punching could be a factor in an it's-all-ZFS's-fault explanation: openat(AT_FDCWD,"base/16390/2662",O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC,0600) = 36 (0x24) openat(AT_FDCWD,"base/1/2662",O_RDWR|O_CLOEXEC,00) = 37 (0x25) lseek(37,0x0,SEEK_END) = 32768 (0x8000) lseek(37,0x0,SEEK_END) = 32768 (0x8000) pwrite(36,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192,0x6000) = 8192 (0x2000) <-- smgrextend(final block) lseek(36,0x0,SEEK_END) = 32768 (0x8000) I was trying to think about how I might go about trying to repro the exact system setup. Evgeny, do you mind sharing your "zfs get all /path/to/pgdata" (curious to see block size, compression settings, anything else etc) and your postgresql.conf? And your exact Ubuntu kernel version and ZFS package versions?