Hi Dave,
On 3/13/22 23:46, Dave Chinner wrote:
On Sun, Mar 13, 2022 at 04:47:19PM +0100, Manfred Spraul wrote:
Hello together,
after a simulated power failure, I have observed:
Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs],
xfs_dir3_block block 0x86f58
[14768.047531] XFS (loop0): Unmount and run xfs_repair
[14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58
XDB3..........oX
For future reference, please paste the entire log message, from
the time that the fs was mounted to the end of the hexdump output.
You might not think the hexdump output is important, but as you'll
see later....
Noted. I had to chose what I add into the mail, too much information.
<<<
Is this a known issue?
Is what a known issue? All this is XFS finding a corrupt metadata
block because a CRC is invalid, which is exactly what it's supposed
to do.
As it is, CRC errors are indicative of storage problem such as bit
errors and torn writes, because what has been read from disk does
not match what XFS wrote when it calculated the CRC.
The image file is here: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz
As first question:
Are 512 byte sectors supported, or does xfs assume that 4096 byte writes are
atomic?
512 byte *IO* is supported on devices that have 512 byte sector
support, but there are other rules that XFS sets for metadata. e.g.
that metadata writes are expected to be written completely or
replayed completely as a whole unit regardless of their length.
This
is bookended by the use of cache flushes and FUAs to ensure that
multi-sector writes are wholly completed before the recovery
information is tossed away.
[...]
How were the power failures simulated:
I added support to nbd to log all write operations, including the written
data. This got merged into nbd-3.24
I've used that to create a log of running dbench (+ a few tar/rm/manual
tests) on a 500 MB image file.
In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long.
If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802
blocks, the above listed error message is shown.
After 1,821,799 or 1,821,803 sectors, everything is ok.
(Correcting my own typo:)
1,821,798 or 1,821,803 are ok.
(block numbers are 0-based)
H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE)
O=0000000010deb000 L=00001000
block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), len
512 (0x200).
block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), len
512 (0x200).
block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), len
512 (0x200). << OK
block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), len
512 (0x200). FAIL
block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), len
512 (0x200). FAIL
block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), len
512 (0x200). FAIL
block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), len
512 (0x200). FAIL
block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), len
512 (0x200). << OK
OK, this test is explicitly tearing writes at the storage level.
When there is an update to multiple sectors of the metadata block,
the metadata will be inconsistent on disk while those individual
sector writes are replayed.
Thanks for the clarification.
I'll modify the test application to never tear write operations and retry.
If there are findings, then I'll distribute them.
--
Manfred