On Fri, Jan 31, 2020 at 03:31:58PM -0800, Dan Williams wrote: > On Thu, Jan 23, 2020 at 11:07 AM Darrick J. Wong > <darrick.wong@xxxxxxxxxx> wrote: > > > > On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote: > > > Hi, > > > > > > This is an RFC patch to provide a dax operation to zero a range of memory. > > > It will also clear poison in the process. This is primarily compile tested > > > patch. I don't have real hardware to test the poison logic. I am posting > > > this to figure out if this is the right direction or not. > > > > > > Motivation from this patch comes from Christoph's feedback that he will > > > rather prefer a dax way to zero a range instead of relying on having to > > > call blkdev_issue_zeroout() in __dax_zero_page_range(). > > > > > > https://lkml.org/lkml/2019/8/26/361 > > > > > > My motivation for this change is virtiofs DAX support. There we use DAX > > > but we don't have a block device. So any dax code which has the assumption > > > that there is always a block device associated is a problem. So this > > > is more of a cleanup of one of the places where dax has this dependency > > > on block device and if we add a dax operation for zeroing a range, it > > > can help with not having to call blkdev_issue_zeroout() in dax path. > > > > > > I have yet to take care of stacked block drivers (dm/md). > > > > > > Current poison clearing logic is primarily written with assumption that > > > I/O is sector aligned. With this new method, this assumption is broken > > > and one can pass any range of memory to zero. I have fixed few places > > > in existing logic to be able to handle an arbitrary start/end. I am > > > not sure are there other dependencies which might need fixing or > > > prohibit us from providing this method. > > > > > > Any feedback or comment is welcome. > > > > So who gest to use this? :) > > > > Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on > > a written extent in a DAX file and call this instead of what it does now > > (punch range and reallocate unwritten)? > > If it eliminates more block assumptions, then yes. In general I think > there are opportunities to use "native" direct_access instead of > block-i/o for other areas too, like metadata i/o. > > > Is this the kind of thing XFS should just do on its own when DAX us that > > some range of pmem has gone bad and now we need to (a) race with the > > userland programs to write /something/ to the range to prevent a machine > > check (b) whack all the programs that think they have a mapping to > > their data, (c) see if we have a DRAM copy and just write that back, (d) > > set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary? > > (a), (b) duplicate what memory error handling already does. So yes, > could be done but it only helps if machine check handling is broken or > missing. <nod> > (c) what DRAM copy in the DAX case? Sorry, I was talking about the fs metadata that we cache in DRAM. > (d) dax fsync is just cache flush, so it can't fail, or are you > talking about errors in metadata? I'm talking about an S_DAX file that someone is doing regular write()s to: 1. Open file O_RDWR 2. Write something to the file 3. Some time later, something decides the pmem is bad. 4. Program calls fsync(); does it return EIO? (I shouldn't have mixed the metadata/file data cases, sorry...) > (e) I thought our solution for dax metadata redundancy is to use a > realtime data device and raid mirror for the metadata device. In the end it was set aside on the grounds that reserving space for a separate metadata device was too costly and too complex for now. We might get back to it later when the <cough> economics improve. > > <cough> Will XFS ever get that "your storage went bad" hook that was > > promised ages ago? > > pmem developers don't scale? Ah, sorry. :/ > > Though I guess it only does this a single page at a time, which won't be > > awesome if we're trying to zero (say) 100GB of pmem. I was expecting to > > see one big memset() call to zero the entire range followed by > > pmem_clear_poison() on the entire range, but I guess you did tag this > > RFC. :) > > Until movdir64b is available the only way to clear poison is by making > a call to the BIOS. The BIOS may not be efficient at bulk clearing. Well then let's port XFS to SMM mode. <duck> (No, please don't...) --D