On Fri, Sep 17, 2021 at 6:30 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > Hi all, > > Jane Chu has taken an interest in trying to fix the pmem poison recovery > story on Linux. Since I sort of had a half-baked patchset that seems to > contain some elements of what the reviewers of her patchset wanted, I'm > releasing this reworked version to see if it has any traction. > > Our current "advice" to people using persistent memory and FSDAX who > wish to recover upon receipt of a media error (aka 'hwpoison') event > from ACPI is to punch-hole that part of the file and then pwrite it, > which will magically cause the pmem to be reinitialized and the poison > to be cleared. > > Punching doesn't make any sense at all -- the (re)allocation on pwrite > does not permit the caller to specify where to find blocks, which means > that we might not get the same pmem back. Not sure this is a driving concern. If you get the same pmem back it will have gone through a poison clearing cycle when it gets reallocated. Also, once the filesystem gets notified of error locations through Ruan's series the FS can avoid allocating blocks where poison clearing failed. > This pushes the user farther > away from the goal of reinitializing poisoned memory and leads to > complaints about unnecessary file fragmentation. Fragmentation though is a valid concern. > > AFAICT, the only reason why the "punch and write" dance works at all is > that the XFS and ext4 currently call blkdev_issue_zeroout when > allocating pmem ahead of a write call. Even a regular overwrite won't > clear the poison, because dax_direct_access is smart enough to bail out > on poisoned pmem, but not smart enough to clear it. Alignment constraints were the entanglement that kept DAX from poison clearing. This is similar to the dance you need to do to get a disk to remap a bad block, which needs an O_DIRECT write. It was also deemed messy to keep overloading writes this way. > To be fair, that > function maps pages and has no idea what kinds of reads and writes the > caller might want to perform. > > Therefore, clean up this whole mess by creating a dax_zeroinit_range > function that callers can use on poisoned persistent memory to reset the > contents of the persistent memory to a known state (all zeroes) and > clear any lingering poison state that might be lingering in the memory > controllers. Create a new fallocate mode to trigger this functionality, > then wire up XFS and ext4 to use it. For good measure, wire it up to > traditional storage if the storage has a fast way to zero LBA contents, > since we assume that those LBAs won't hit old media errors. Sounds good, I'll take a look at the rest.