On Tue, Aug 17, 2021 at 08:39:00AM +0100, Christoph Hellwig wrote: > On Mon, Aug 16, 2021 at 02:05:18PM -0700, Darrick J. Wong wrote: > > AFAICT, the only reason why the "punch and write" dance works at all is > > that the XFS and ext4 currently call blkdev_issue_zeroout when > > allocating pmem as part of a pwrite call. A pwrite without the punch > > won't clear the poison, because pwrite on a DAX file calls > > dax_direct_access to access the memory directly, and dax_direct_access > > is only smart enough to bail out on poisoned pmem. It does not know how > > to clear it. Userspace could solve the problem by calling FIEMAP and > > issuing a BLKZEROOUT, but that requires rawio capabilities. > > > > The whole pmem poison recovery story is is wrong and needs to be > > corrected ASAP before everyone else starts doing this. Therefore, > > create a dax_zeroinit_range function that filesystems can call to reset > > the contents of the pmem to a known value and clear any state associated > > with the media error. Then, connect FALLOC_FL_ZERO_RANGE to this new > > function (for DAX files) so that unprivileged userspace has a safe way > > to reset the pmem and clear media errors. > > I agree with the problem statement, but I don't think the fix is > significantly better than what we have, as it still magically overloads > other behavior. I'd rather have an explicit operation to clear the > poison both at the syscall level (maybe another falloc mode), and the > internal kernel API level (new method in dax_operations). I've long wondered why we can't just pass a write flag to the direct_access functions so that pmem_dax_direct_access can clear the poison. Then we ought to be able to tell userspace that they can recover from write errors by pwrite() or triggering a write fault on the page, I think. That's how userspace recovers from IO errors on traditional disks; I've never understood why it has to be any different now. > Also for the next iteration please split the iomap changes from the > usage in xfs. ok. --D