On Thu, Nov 4, 2021 at 11:34 AM Jane Chu <jane.chu@xxxxxxxxxx> wrote: > > Thanks for the enlightening discussion here, it's so helpful! > > Please allow me to recap what I've caught up so far - > > 1. recovery write at page boundary due to NP setting in poisoned > page to prevent undesirable prefetching > 2. single interface to perform 3 tasks: > { clear-poison, update error-list, write } > such as an API in pmem driver. > For CPUs that support MOVEDIR64B, the 'clear-poison' and 'write' > task can be combined (would need something different from the > existing _copy_mcsafe though) and 'update error-list' follows > closely behind; > For CPUs that rely on firmware call to clear posion, the existing > pmem_clear_poison() can be used, followed by the 'write' task. > 3. if user isn't given RWF_RECOVERY_FLAG flag, then dax recovery > would be automatic for a write if range is page aligned; > otherwise, the write fails with EIO as usual. > Also, user mustn't have punched out the poisoned page in which > case poison repairing will be a lot more complicated. > 4. desirable to fetch as much data as possible from a poisoned range. > > If this understanding is in the right direction, then I'd like to > propose below changes to > dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter() > and the dm layer copy_to/from_iter, dax_iomap_iter(). > > 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there > is likely media error: if the API without DAX_F_RECOVERY returns > -EIO, then switch to recovery-read/write code. In recovery code, > supply DAX_F_RECOVERY to dax_direct_access() in order to obtain > 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY. I like it. It allows for an atomic write+clear implementation on capable platforms and coordinates with potentially unmapped pages. The best of both worlds from the dax_clear_poison() proposal and my "take a fault and do a slow-path copy". > 2. the _copy_to/from_iter implementation would be largely the same > as in my recent patch, but some changes in Christoph's > 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously > virtual devices don't have the ability to clear poison, so no need > to complicate them. And this also means that not every endpoint > dax device has to provide dax_op.copy_to/from_iter, they may use the > default. Did I miss this series or are you talking about this one? https://lore.kernel.org/all/20211018044054.1779424-1-hch@xxxxxx/ > I'm not sure about nova and others, if they use different 'write' other > than via iomap, does that mean there will be need for a new set of > dax_op for their read/write? No, they're out-of-tree they'll adjust to the same interface that xfs and ext4 are using when/if they go upstream. > the 3-in-1 binding would always be > required though. Maybe that'll be an ongoing discussion? Yeah, let's cross that bridge when we come to it. > Comments? Suggestions? It sounds great to me!