On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > ...so would you happen to know if anyone's working on solving this > problem for us by putting the memory controller in charge of dealing > with media errors? The only one who could know is Intel.. > The trouble is, we really /do/ want to be able to (re)write the failed > area, and we probably want to try to read whatever we can. Those are > reads and writes, not {pre,f}allocation activities. This is where Dave > and I arrived at a month ago. > > Unless you'd be ok with a second IO path for recovery where we're > allowed to be slow? That would probably have the same user interface > flag, just a different path into the pmem driver. Which is fine with me. If you look at the API here we do have the RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ API which then gets special casing over three methods. And while Pavel pointed out that he and Jens are now optimizing for single branches like this. I think this actually is silly and it is not my point. The point is that the DAX in-kernel API is a mess, and before we make it even worse we need to sort it first. What is directly relevant here is that the copy_from_iter and copy_to_iter APIs do not make sense. Most of the DAX API is based around getting a memory mapping using ->direct_access, it is just the read/write path which is a slow path that actually uses this. I have a very WIP patch series to try to sort this out here: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize But back to this series. The basic DAX model is that the callers gets a memory mapping an just works on that, maybe calling a sync after a write in a few cases. So any kind of recovery really needs to be able to work with that model as going forward the copy_to/from_iter path will be used less and less. i.e. file systems can and should use direct_access directly instead of using the block layer implementation in the pmem driver. As an example the dm-writecache driver, the pending bcache nvdimm support and the (horribly and out of tree) nova file systems won't even use this path. We need to find a way to support recovery for them. And overloading it over the read/write path which is not the main path for DAX, but the absolutely fast path for 99% of the kernel users is a horrible idea. So how can we work around the horrible nvdimm design for data recovery in a way that: a) actually works with the intended direct memory map use case b) doesn't really affect the normal kernel too much ?