Re: Question about the "EXPERIMENTAL" tag for dax in XFS

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 2 Mar 2021 16:38:28 +1100

On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> [..]
> > We do not need a DAX specific mechanism to tell us "DAX device
> > gone", we need a generic block device interface that tells us "range
> > of block device is gone".
> 
> This is the crux of the disagreement. The block_device is going away
> *and* the dax_device is going away.

No, that is not the disagreement I have with what you are saying.
You still haven't understand that it's even more basic and generic
than devices going away. At the simplest form, all the filesystem
wants is to be notified of is when *unrecoverable media errors*
occur in the persistent storage that underlies the filesystem.

The filesystem does not care what that media is build from - PMEM,
flash, corroded spinning disks, MRAM, or any other persistent media
you can think off. It just doesn't matter.

What we care about is that the contents of a *specific LBA range* no
longer contain *valid data*. IOWs, the data in that range of the
block device has been lost, cannot be retreived and/or cannot be
written to any more.

PMEM taking a MCE because ECC tripped is a media error because data
is lost and inaccessible until recovery actions are taken.

MD RAID failing a scrub is a media error and data is lost and
unrecoverable at that layer.

A device disappearing is a media error because the storage media is
now permanently inaccessible to the higher layers.

This "media error" categorisation is a fundamental property of
persistent storage and, as such, is a property of the block devices
used to access said persistent storage.

That's the disagreement here - that you and Christoph are saying
->corrupted_range is not a block device property because only a
pmem/DAX device currently generates it.

You both seem to be NACKing a generic interface because it's only
implemented for the first subsystem that needs it. AFAICT, you
either don't understand or are completely ignoring the architectural
need for it to be provided across the rest of the storage stack that
*block device based filesystems depend on*.

Sure, there might be dax device based fielsystems around the corner.
They just require a different pmem device ->corrupted_range callout
to implement the notification - one that directs to the dax device
rather than the block device. That's simple and trivial to
implement, but such functionaity for DAX devices  does not replace
the need for the same generic functionality to be provided across a
*range of different block devices* as required by *block device
based filesystems*.

And that's fundamentally the problem. XFS is block device based, not
DAX device based. We require errors to be reported through block
device mechanisms. fs-dax does not change this - it is based on pmem
being presented as a primarily as a block device to the block device
based filesystems and only secondarily as a dax device. Hence if it
can be trivially implemented as a block device interface, that's
where it should go, because then all the other block devices that
the filesytem runs on can provide the same functionality for similar
media error events....

> The dax_device removal implies one
> set of actions (direct accessed pfns invalid) the block device removal
> implies another (block layer sector access offline).

There you go again, saying DAX requires an action, while the block
device notification is a -state change- (i.e. goes offline).

This is exactly what I said was wrong in my last email.

> corrupted_range
> is blurring the notification for 2 different failure domains. Look at
> the nascent idea to mount a filesystem on dax sans a block device.
> Look at the existing plumbing for DM to map dax_operations through a
> device stack.

Ummm, it just maps the direct_access call to the underlying device
and calls it's ->direct_access method. All it's doing is LBA
mapping. That's all it needs to do for ->corrupted_range, too.
I have no clue why you think this is a problem for error
notification...

> Look at the pushback Ruan got for adding a new
> block_device operation for corrupted_range().

one person said "no". That's hardly pushback. Especially as I think
Christoph's objection about this being dax specific functionality
is simply wrong, as per above.

> > This is why we need to communicate what error occurred, not what
> > action a device driver thinks needs to be taken.
> 
> The driver is only an event producer in this model, whatever the
> consumer does at the other end is not its concern. There may be a
> generic consumer and a filesystem specific consumer.

<sigh>

That's why these are all ops functions that can provide multiple
implementations to different device types. So that when we get a new
use case, the ops function structure can be replaced with one that
directs the notification to the new user instead of to the existing
one. It's a design pattern we use all over the kernel code.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx