Re: [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag

Lukas Straub <lukasstraub2@xxxxxx> · Sat, 6 Nov 2021 07:41:46 +0000

On Tue, 2 Nov 2021 09:03:55 -0700
Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

> On Tue, Oct 26, 2021 at 11:50 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:  
> > > Thanks - I try to be honest.  As far as I can tell, the argument
> > > about the flag is a philosophical argument between two views.
> > > One view assumes design based on perfect hardware, and media error
> > > belongs to the category of brokenness. Another view sees media
> > > error as a build-in hardware component and make design to include
> > > dealing with such errors.  
> >
> > No, I don't think so.  Bit errors do happen in all media, which is
> > why devices are built to handle them.  It is just the Intel-style
> > pmem interface to handle them which is completely broken.  
> 
> No, any media can report checksum / parity errors. NVME also seems to
> do a poor job with multi-bit ECC errors consumed from DRAM. There is
> nothing "pmem" or "Intel" specific here.
> 
> > > errors in mind from start.  I guess I'm trying to articulate why
> > > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > > and its slow path (w/ error clearing) is faster than other alternative.
> > > Other alternative being 1 system call to clear the poison, and
> > > another system call to run the fast pwrite for recovery, what
> > > happens if something happened in between?  
> >
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path.  Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.  
> 
> I would expect this interface to be useful outside of pmem as a
> "failfast" or "try harder to recover" flag for reading over media
> errors.

Yeah, I think this flag could also be useful for non-raid btrfs.

If you have an extend that is shared between multiple snapshots and
it's data is corrupted (without the disk returning an i/o error), btrfs
won't be able to fix the corruption without raid and will always return
an i/o error when accessing the affected range (due to checksum
mismatch).

Of course you could just overwrite the range in the file with good
data, but that would only fix the file you are operating on, snapshots
will still reference the corrupted data.

With this flag, a read could just return the corrupted data without i/o
error and a write could write directly to the on-disk data to fixup the
corruption everywhere. btrfs could also check that the newly written
data actually matches the checksum.
However, in this btrfs usecase the process still needs to be
CAP_SYS_ADMIN or similar, since it's easy to create collisions for
crc32 and so an attacker could write to a file that he has no
permissions for, if that file shares an extend with one where he has
write permissions.

Regards,
Lukas Straub
-- 

Attachment:
pgpaQZFHjomBN.pgp

Description: OpenPGP digital signature
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel