Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 17 Mar 2022 13:47:05 +1100

On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote:
> Hi Dave,
> 
> On 3/14/22 16:18, Manfred Spraul wrote:
> > Hi Dave,
> > 
> > On 3/13/22 23:46, Dave Chinner wrote:
> > > OK, this test is explicitly tearing writes at the storage level.
> > > When there is an update to multiple sectors of the metadata block,
> > > the metadata will be inconsistent on disk while those individual
> > > sector writes are replayed.
> > 
> > Thanks for the clarification.
> > 
> > I'll modify the test application to never tear write operations and
> > retry.
> > 
> > If there are findings, then I'll distribute them.
> > 
> I've modified the test app, and with 4000 simulated power failures I have
> not seen any corruptions.
> 
> 
> Thus:
> 
> - With teared write operations: 2 corruptions from ~800 simulated power
> failures
> 
> - Without teared write operations: no corruptions from ~4000 simulated power
> failures.

Good to hear.

> But:
> 
> I've checked the eMMC specification, and the spec allows that teared write
> happen:

Yes, most storage only guarantees that sector writes are atomic and
so multi-sector writes have no guarantees of being written
atomically.  IOWs, all storage technologies that currently exist are
allowed to tear multi-sector writes.

However, FUA writes are guaranteed to be whole on persistent storage
regardless of size when the hardware signals completion. And any
write that the hardware has signalled as complete before a cache
flush is received is also guaranteed to be whole on persistent
storage when the cache flush is signalled as complete by the
hardware. These mechanisms provide protection against torn writes.

IOWs, it's up to filesystems to guarantee data is on stable storage
before they trust it fully. Filesystems are pretty good at using
REQ_FLUSH, REQ_FUA and write completion ordering to ensure that
anything they need whole and complete on stable storage is actually
whole and complete.

In the cases where torn writes occur because that haven't been
covered by a FUA or cache flush guarantee (such as your test),
filesystems need mechanisms in their metadata to detect such events.
CRCs are the prime mechanism for this - that's what XFS uses, and it
was XFS reporting a CRC failure when reading torn metadata that
started this whole thread.

> Is my understanding correct that XFS support neither eMMC nor NVM devices?
> (unless there is a battery backup that exceeds the guarantees from the spec)

Incorrect.

They are supported just fine because flush/FUA semantics provide
guarantees against torn writes in normal operation. IOWs, torn
writes are something that almost *never* happen in real life, even
when power fails suddenly. Despite this, XFS can detect it has
occurred (because broken storage is all too common!), and if it
can't recovery automatically, it will shut down and ask the user to
correct the problem.

BTRFS and ZFS can also detect torn writes, and if you use the
(non-default) ext4 option "metadata_csum" it will also detect torn
writes to metadata via CRC failures. There are other filesystems
that can detect and correct torn writes, too.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx