So... I have a machine with an nvme drive manufactured by a certain manufacturer who isn't known for the quality of their firmware implementation. I'm pretty sure that this is a result of the use of fallocate(FALLOC_FL_ZERO_RANGE) to zero the log during format. If I format a device, mounting and repair both fail because the primary superblock UUID doesn't match the log UUID: [root@abacus654 ~]# btrace /dev/nvme0n1 > btrace.txt & [root@abacus654 ~]# strace -s99 -o /tmp/a mkfs.xfs /dev/nvme0n1 -f meta-data=/dev/nvme0n1 isize=512 agcount=6, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=1542990848, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Discarding blocks...Done. [root@abacus654 ~]# fg btrace /dev/nvme0n1 > btrace.txt^C [root@abacus654 ~]# xfs_repair -n /dev/nvme0n1 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... * ERROR: mismatched uuid in log * SB : 1f0e74b7-c3af-4da1-9f15-3aa3605faed7 * log: 83a2835b-58f4-4e2d-bc3f-e063617363dd - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... So then I looked at the log locations via xfs_db: [root@abacus654 ~]# xfs_db -c 'sb' -c 'print logstart logblocks blocksize' -c 'convert fsb 805306374 daddr' /dev/nvme0n1 logstart = 805306374 logblocks = 521728 blocksize = 4096 [root@abacus654 ~]# xfs_db -c 'sb' -c 'convert fsb 805306374 daddr' /dev/nvme0n1 0x180000018 (6442450968) Ok, so the log starts at (512b sector) 6442450968, which is byte offset 3298534895616 in /dev/nvme0n1. Checking the strace log reveals: [root@abacus654 ~]# grep 3298534895616 /tmp/a fallocate(4, FALLOC_FL_ZERO_RANGE, 3298534895616, 2136997888) = 0 pwrite64(4, <buffer>, 1024, 3298534895616) = 1024 Which shows that first we instructed the device to zero the entire log, and then we wrote the log head to the first 1024 bytes of the log. If we dump the log contents, we get: [root@abacus654 ~]# dd if=/dev/nvme0n1 skip=805306371 count=521728 bs=4096 of=/tmp/badlog iflag=direct [root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog 0000000 fe ed ba be 00 00 00 01 00 00 00 02 00 00 02 00 376 355 272 276 \0 \0 \0 001 \0 \0 \0 002 \0 \0 002 \0 <snip> 0000304 1f 0e 74 b7 c3 af 4d a1 9f 15 3a a3 60 5f ae d7 037 016 t 267 303 257 M 241 237 025 : 243 ` _ 256 327 Ok, so this is the 1024 bytes that we wrote out, and it matches the rest of the filesystem. Note the 1f 0e 74 b7 c3 af... sequence near byte 304, which is the fs uuid according to xlog_rec_header. This looks good so far. But then we get to the next 1024 bytes: 0001024 fe ed ba be 00 00 00 01 00 00 00 02 00 00 7e 00 376 355 272 276 \0 \0 \0 001 \0 \0 \0 002 \0 \0 ~ \0 <snip> 0001328 83 a2 83 5b 58 f4 4e 2d bc 3f e0 63 61 73 63 dd 203 242 203 [ X 364 N - 274 ? 340 c a s c 335 Notice the UUID here doesn't match -- it's the 83 a2 83 5b... sequence that repair complained about above. Clearly, the log zeroing didn't work, and mkfs didn't even try a trivial check that the accelerated zeroing worked properly. Indeed, I can even re-zero the entire log from the command line: [root@abacus654 ~]# btrace /dev/nvme0n1 > btrace2.txt & [root@abacus654 ~]# strace -s99 -e fallocate xfs_io -c 'fzero 3298534895616 2136997888' /dev/nvme0n1 fallocate(3, FALLOC_FL_ZERO_RANGE, 3298534895616, 2136997888) = 0 [root@abacus654 ~]# fg btrace /dev/nvme0n1 > btrace.txt^C And then re-excerpt the log: [root@abacus654 ~]# dd if=/dev/nvme0n1 skip=805306371 count=521728 bs=4096 of=/tmp/badlog2 iflag=direct [root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog2 | less 0000000 fe ed ba be 00 00 00 01 00 00 00 02 00 00 02 00 376 355 272 276 \0 \0 \0 001 \0 \0 \0 002 \0 \0 002 \0 0000016 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 \0 \0 \0 001 \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 \0 0000032 00 00 00 00 ff ff ff ff 00 00 00 01 b0 c0 d0 d0 \0 \0 \0 \0 377 377 377 377 \0 \0 \0 001 260 300 320 320 0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 0000288 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 001 0000304 1f 0e 74 b7 c3 af 4d a1 9f 15 3a a3 60 5f ae d7 037 016 t 267 303 257 M 241 237 025 : 243 ` _ 256 327 Which should have zeroed the log completely. Looking at the block trace from the mkfs run, I see: [root@abacus654 ~]# grep 6442450968 btrace.txt 259,0 4 5907 0.118525480 26004 Q WS 6442450968 + 4173824 [mkfs.xfs] 259,0 4 5908 0.118541800 0 C WS 6442450968 + 4173824 [0] 259,0 4 5909 0.118572240 26004 Q WS 6442450968 + 2 [mkfs.xfs] 259,0 4 5910 0.118588160 0 C WS 6442450968 + 2 [0] So we did in fact issue a write to the device, but we can still read the previous contents after the write completes! Hooray, the device firmware is broken! If I look at the block trace from the xfs_io zero-range invocation, I see: [root@abacus654 ~]# grep 6442450968 btrace2.txt 259,0 6 1 1266874889.701085336 26385 Q WS 6442450968 + 4173824 [xfs_io] 259,0 6 2 1266874889.701172696 0 C WS 6442450968 + 4173824 [0] So the kernel indeed sent the write request, but a subsequent re-read of the log contents still shows the old log contents, which is why repair and mount get mad. Note that a standard overwite of the log causes a subsequent re-read to produce zeroes: [root@abacus654 ~]# dd if=/dev/zero of=/dev/nvme0n1 seek=805306371 count=521728 bs=4096 oflag=direct,sync 521728+0 records in 521728+0 records out 2136997888 bytes (2.1 GB, 2.0 GiB) copied, 9.75987 s, 219 MB/s [root@abacus654 ~]# dd if=/dev/nvme0n1 skip=805306371 count=521728 bs=4096 of=/tmp/badlog3 iflag=direct 521728+0 records in 521728+0 records out 2136997888 bytes (2.1 GB, 2.0 GiB) copied, 8.55764 s, 250 MB/s [root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog3 | less [root@abacus654 ~]# ^C [root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog3 | head -n15 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 2136997888 And the format works this time too: [root@abacus654 ~]# strace -s99 -o /tmp/a mkfs.xfs /dev/nvme0n1 -f meta-data=/dev/nvme0n1 isize=512 agcount=6, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=1542990848, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Discarding blocks...Done. (reverse-i-search)`-n': od -tx1 -Ad -c /tmp/badlog3 | head ^C15 [root@abacus654 ~]# xfs_repair -n /dev/nvme0n1 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... In conclusion, the drive firmware is broken. Question: Should we be doing /some/ kind of re-read after a zeroing the log to detect these sh*tty firmwares and fall back to a pwrite()? --D