On Wed, Nov 3, 2021 at 11:10 AM Jane Chu <jane.chu@xxxxxxxxxx> wrote: > > On 11/1/2021 11:18 PM, Christoph Hellwig wrote: > > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > >> ...so would you happen to know if anyone's working on solving this > >> problem for us by putting the memory controller in charge of dealing > >> with media errors? > > > > The only one who could know is Intel.. > > > >> The trouble is, we really /do/ want to be able to (re)write the failed > >> area, and we probably want to try to read whatever we can. Those are > >> reads and writes, not {pre,f}allocation activities. This is where Dave > >> and I arrived at a month ago. > >> > >> Unless you'd be ok with a second IO path for recovery where we're > >> allowed to be slow? That would probably have the same user interface > >> flag, just a different path into the pmem driver. > > > > Which is fine with me. If you look at the API here we do have the > > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ > > API which then gets special casing over three methods. > > > > And while Pavel pointed out that he and Jens are now optimizing for > > single branches like this. I think this actually is silly and it is > > not my point. > > > > The point is that the DAX in-kernel API is a mess, and before we make > > it even worse we need to sort it first. What is directly relevant > > here is that the copy_from_iter and copy_to_iter APIs do not make > > sense. Most of the DAX API is based around getting a memory mapping > > using ->direct_access, it is just the read/write path which is a slow > > path that actually uses this. I have a very WIP patch series to try > > to sort this out here: > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > > > > But back to this series. The basic DAX model is that the callers gets a > > memory mapping an just works on that, maybe calling a sync after a write > > in a few cases. So any kind of recovery really needs to be able to > > work with that model as going forward the copy_to/from_iter path will > > be used less and less. i.e. file systems can and should use > > direct_access directly instead of using the block layer implementation > > in the pmem driver. As an example the dm-writecache driver, the pending > > bcache nvdimm support and the (horribly and out of tree) nova file systems > > won't even use this path. We need to find a way to support recovery > > for them. And overloading it over the read/write path which is not > > the main path for DAX, but the absolutely fast path for 99% of the > > kernel users is a horrible idea. > > > > So how can we work around the horrible nvdimm design for data recovery > > in a way that: > > > > a) actually works with the intended direct memory map use case > > b) doesn't really affect the normal kernel too much > > > > ? > > > > This is clearer, I've looked at your 'dax-devirtualize' patch which > removes pmem_copy_to/from_iter, and as you mentioned before, > a separate API for poison-clearing is needed. So how about I go ahead > rebase my earlier patch > > https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@xxxxxxxxxx/ > on 'dax-devirtualize', provide dm support for clear-poison? > That way, the non-dax 99% of the pwrite use-cases aren't impacted at all > and we resolve the urgent pmem poison-clearing issue? > > Dan, are you okay with this? I am getting pressure from our customers > who are basically stuck at the moment. The concern I have with dax_clear_poison() is that it precludes atomic error clearing. Also, as Boris and I discussed, poisoned pages should be marked NP (not present) rather than UC (uncacheable) [1]. With those 2 properties combined I think that wants a custom pmem fault handler that knows how to carefully write to pmem pages with poison present, rather than an additional explicit dax-operation. That also meets Christoph's requirement of "works with the intended direct memory map use case". [1]: https://lore.kernel.org/r/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@xxxxxxxxxxxxxx