Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

Dan Williams <dan.j.williams@xxxxxxxxx> · Mon, 2 May 2016 16:25:51 -0700

On Mon, May 2, 2016 at 4:04 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
>> Dave Chinner <david@xxxxxxxxxxxxx> writes:
>>
>> > On Mon, Apr 25, 2016 at 11:53:13PM +0000, Verma, Vishal L wrote:
>> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
>> > You're assuming that only the DAX aware application accesses it's
>> > files.  users, backup programs, data replicators, fileystem
>> > re-organisers (e.g.  defragmenters) etc all may access the files and
>> > they may throw errors. What then?
>>
>> I'm not sure how this is any different from regular storage.  If an
>> application gets EIO, it's up to the app to decide what to do with that.
>
> Sure - they'll fail. But the question I'm asking is that if the
> application that owns the data is supposed to do error recovery,
> what happens when a 3rd party application hits an error? If that
> consumes the error, the the app that owns the data won't ever get a
> chance to correct the error.
>
> This is a minefield - a 3rd party app that swallows and clears DAX
> based IO errors is a data corruption vector. can yo imagine if
> *grep* did this? The model that is being promoted here effectively
> allows this sort of behaviour - I don't really think we
> should be architecting an error recovery strategy that has the
> capability to go this wrong....

Since when does grep write to a file on error?

>
>> >> > Where does the application find the data that was lost to be able to
>> >> > rewrite it?
>> >>
>> >> The data that was lost is gone -- this assumes the application has some
>> >> ability to recover using a journal/log or other redundancy - yes, at the
>> >> application layer. If it doesn't have this sort of capability, the only
>> >> option is to restore files from a backup/mirror.
>> >
>> > So the architecture has a built in assumption that only userspace
>> > can handle data loss?
>>
>> Remember that the proposed programming model completely bypasses the
>> kernel, so yes, it is expected that user-space will have to deal with
>> the problem.
>
> No, it doesn't completely bypass the kernel - the kernel is the
> infrastructure that catches the errors in the first place, and it
> owns and controls all the metadata that corresponds to the physical
> location of that error. The only thing the kernel doesn't own is the
> *contents* of that location.
>
>> > What about filesytsems like NOVA, that use log structured design to
>> > provide DAX w/ update atomicity and can potentially also provide
>> > redundancy/repair through the same mechanisms? Won't pmem native
>> > filesystems with built in data protection features like this remove
>> > the need for adding all this to userspace applications?
>>
>> I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
>> with this for existing file systems, right?
>
> Yes, but that misses my point that it seems that the design is only
> focussed on userspace and existing filesystems and there is no
> consideration of kernel side functionality that could do transparent
> recovery....
>
>> > If so, shouldn't that be the focus of development rahter than
>> > placing the burden on userspace apps to handle storage repair
>> > situations?
>>
>> It really depends on the programming model.  In the model Vishal is
>> talking about, either the applications themselves or the libraries they
>> link to are expected to implement the redundancies where necessary.
>
> IOWs, filesystems no longer have any control over data integrity.
> Yet it's the filesystem developers who will still be responsible for
> data integrity and when the filesystem has a data coruption event
> we'll get blamed and the filesystem gets a bad name, even though
> it's entirely the applications fault. We've seen this time and time
> again - application developers cannot be trusted to guarantee data
> integrity. yes, some apps will be fine, but do you really expect
> application devs that refuse to use fsync because it's too slow are
> going to have a different approach to integrity when it comes to
> DAX?

Yes, completely agree.  The applications that will implement competent
error recovery with these mechanisms will be vanishingly small, and
there is definite room for a kernel data-redundancy solution that
builds on these patches.

>
>> >> > There's an implicit assumption that applications will keep redundant
>> >> > copies of their data at the /application layer/ and be able to
>> >> > automatically repair it?
>>
>> That's one way to do things.  It really depends on the application what
>> it will do for recovery.
>>
>> >> > And then there's the implicit assumption that it will unlink and
>> >> > free the entire file before writing a new copy
>>
>> I think Vishal was referring to restoring from backup.  cp itself will
>> truncate the file before overwriting, iirc.
>
> Which version of cp? what happens if they use --sparse and the error
> is in a zeroed region? There's so many assumptions about undefined userspace
> environment, application and user behaviour being made here, and
> it's all being handwaved away.
>
> I'm asking for this to be defined, demonstrated and documented as a
> working model that cannot be abused and doesn't have holes the size
> of trucks in it, not handwaving...

You lost me...  how are these patches abusing the existing semantics
of -EIO and write to clear?

>> >> To summarize, the two cases we want to handle are:
>> >> 1. Application has inbuilt recovery:
>> >>   - hits badblock
>> >>   - figures out it is able to recover the data
>> >>   - handles SIGBUS or EIO
>> >>   - does a (sector aligned) write() to restore the data
>> >
>> > The "figures out" step here is where >95% of the work we'd have to
>> > do is. And that's in filesystem and block layer code, not
>> > userspace, and userspace can't do that work in a signal handler.
>> > And it  can still fall down to the second case when the application
>> > doesn't have another copy of the data somewhere.
>>
>> I read that "figures out" step as the application determining whether or
>> not it had a redundant copy.
>
> Another undocumented assumption, that doesn't simplify what needs to
> be done. Indeed, userspace can't do that until it is in SIGBUS
> context, which tends to imply applications need to do a major amount
> of work from within the signal handler....
>
>> > FWIW, we don't have a DAX enabled filesystem that can do
>> > reverse block mapping, so we're a year or two away from this being a
>> > workable production solution from the filesystem perspective. And
>> > AFAICT, it's not even on the roadmap for dm/md layers.
>>
>> Do we even need that?  What if we added an FIEMAP flag for determining
>> bad blocks.
>
> So you're assuming that the filesystem has been informed of the bad
> blocks and has already marked the bad regions of the file in it's
> extent list?
>
> How does that happen? What mechanism is used for the underlying
> block device to inform the filesytem that theirs a bad LBA, and how
> does the filesytem the map that to a path/file/offset with reverse
> mapping? Or is there some other magic that hasn't been explained
> happening here?

In 4.5 we added this:

commit 99e6608c9e7414ae4f2168df8bf8fae3eb49e41f
Author: Vishal Verma <vishal.l.verma@xxxxxxxxx>
Date:   Sat Jan 9 08:36:51 2016 -0800

    block: Add badblock management for gendisks

    NVDIMM devices, which can behave more like DRAM rather than block
    devices, may develop bad cache lines, or 'poison'. A block device
    exposed by the pmem driver can then consume poison via a read (or
    write), and cause a machine check. On platforms without machine
    check recovery features, this would mean a crash.

    The block device maintaining a runtime list of all known sectors that
    have poison can directly avoid this, and also provide a path forward
    to enable proper handling/recovery for DAX faults on such a device.

    Use the new badblock management interfaces to add a badblocks list to
    gendisks.

    Signed-off-by: Vishal Verma <vishal.l.verma@xxxxxxxxx>
    Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html