Re: [RFC] dax,pmem: Provide a dax operation to zero range of memory

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 4 Feb 2020 15:23:18 -0800

On Fri, Jan 31, 2020 at 03:31:58PM -0800, Dan Williams wrote:
> On Thu, Jan 23, 2020 at 11:07 AM Darrick J. Wong
> <darrick.wong@xxxxxxxxxx> wrote:
> >
> > On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> > > Hi,
> > >
> > > This is an RFC patch to provide a dax operation to zero a range of memory.
> > > It will also clear poison in the process. This is primarily compile tested
> > > patch. I don't have real hardware to test the poison logic. I am posting
> > > this to figure out if this is the right direction or not.
> > >
> > > Motivation from this patch comes from Christoph's feedback that he will
> > > rather prefer a dax way to zero a range instead of relying on having to
> > > call blkdev_issue_zeroout() in __dax_zero_page_range().
> > >
> > > https://lkml.org/lkml/2019/8/26/361
> > >
> > > My motivation for this change is virtiofs DAX support. There we use DAX
> > > but we don't have a block device. So any dax code which has the assumption
> > > that there is always a block device associated is a problem. So this
> > > is more of a cleanup of one of the places where dax has this dependency
> > > on block device and if we add a dax operation for zeroing a range, it
> > > can help with not having to call blkdev_issue_zeroout() in dax path.
> > >
> > > I have yet to take care of stacked block drivers (dm/md).
> > >
> > > Current poison clearing logic is primarily written with assumption that
> > > I/O is sector aligned. With this new method, this assumption is broken
> > > and one can pass any range of memory to zero. I have fixed few places
> > > in existing logic to be able to handle an arbitrary start/end. I am
> > > not sure are there other dependencies which might need fixing or
> > > prohibit us from providing this method.
> > >
> > > Any feedback or comment is welcome.
> >
> > So who gest to use this? :)
> >
> > Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
> > a written extent in a DAX file and call this instead of what it does now
> > (punch range and reallocate unwritten)?
> 
> If it eliminates more block assumptions, then yes. In general I think
> there are opportunities to use "native" direct_access instead of
> block-i/o for other areas too, like metadata i/o.
> 
> > Is this the kind of thing XFS should just do on its own when DAX us that
> > some range of pmem has gone bad and now we need to (a) race with the
> > userland programs to write /something/ to the range to prevent a machine
> > check (b) whack all the programs that think they have a mapping to
> > their data, (c) see if we have a DRAM copy and just write that back, (d)
> > set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary?
> 
> (a), (b) duplicate what memory error handling already does. So yes,
> could be done but it only helps if machine check handling is broken or
> missing.

<nod> 

> (c) what DRAM copy in the DAX case?

Sorry, I was talking about the fs metadata that we cache in DRAM.

> (d) dax fsync is just cache flush, so it can't fail, or are you
> talking about errors in metadata?

I'm talking about an S_DAX file that someone is doing regular write()s
to:

1. Open file O_RDWR
2. Write something to the file
3. Some time later, something decides the pmem is bad.
4. Program calls fsync(); does it return EIO?

(I shouldn't have mixed the metadata/file data cases, sorry...)

> (e) I thought our solution for dax metadata redundancy is to use a
> realtime data device and raid mirror for the metadata device.

In the end it was set aside on the grounds that reserving space for
a separate metadata device was too costly and too complex for now.
We might get back to it later when the <cough> economics improve.

> > <cough> Will XFS ever get that "your storage went bad" hook that was
> > promised ages ago?
> 
> pmem developers don't scale?

Ah, sorry. :/

> > Though I guess it only does this a single page at a time, which won't be
> > awesome if we're trying to zero (say) 100GB of pmem.  I was expecting to
> > see one big memset() call to zero the entire range followed by
> > pmem_clear_poison() on the entire range, but I guess you did tag this
> > RFC. :)
> 
> Until movdir64b is available the only way to clear poison is by making
> a call to the BIOS. The BIOS may not be efficient at bulk clearing.

Well then let's port XFS to SMM mode. <duck>

(No, please don't...)

--D