On 11/12/2021 7:36 AM, Mike Snitzer wrote: > On Wed, Nov 10 2021 at 1:26P -0500, > Jane Chu <jane.chu@xxxxxxxxxx> wrote: > >> On 11/9/2021 1:02 PM, Dan Williams wrote: >>> On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@xxxxxxxxxx> wrote: >>>> >>>> On 11/9/2021 10:48 AM, Dan Williams wrote: >>>>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote: >>>>>>> static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, >>>>>>> void *addr, size_t bytes, struct iov_iter *i, int mode) >>>>>>> { >>>>>>> + phys_addr_t pmem_off; >>>>>>> + size_t len, lead_off; >>>>>>> + struct pmem_device *pmem = dax_get_private(dax_dev); >>>>>>> + struct device *dev = pmem->bb.dev; >>>>>>> + >>>>>>> + if (unlikely(mode == DAX_OP_RECOVERY)) { >>>>>>> + lead_off = (unsigned long)addr & ~PAGE_MASK; >>>>>>> + len = PFN_PHYS(PFN_UP(lead_off + bytes)); >>>>>>> + if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) { >>>>>>> + if (lead_off || !(PAGE_ALIGNED(bytes))) { >>>>>>> + dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n", >>>>>>> + addr, bytes); >>>>>>> + return (size_t) -EIO; >>>>>>> + } >>>>>>> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; >>>>>>> + if (pmem_clear_poison(pmem, pmem_off, bytes) != >>>>>>> + BLK_STS_OK) >>>>>>> + return (size_t) -EIO; >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> This is in the wrong spot. As seen in my WIP series individual drivers >>>>>> really should not hook into copying to and from the iter, because it >>>>>> really is just one way to write to a nvdimm. How would dm-writecache >>>>>> clear the errors with this scheme? >>>>>> >>>>>> So IMHO going back to the separate recovery method as in your previous >>>>>> patch really is the way to go. If/when the 64-bit store happens we >>>>>> need to figure out a good way to clear the bad block list for that. >>>>> >>>>> I think we just make error management a first class citizen of a >>>>> dax-device and stop abstracting it behind a driver callback. That way >>>>> the driver that registers the dax-device can optionally register error >>>>> management as well. Then fsdax path can do: >>>>> >>>>> rc = dax_direct_access(..., &kaddr, ...); >>>>> if (unlikely(rc)) { >>>>> kaddr = dax_mk_recovery(kaddr); >>>> >>>> Sorry, what does dax_mk_recovery(kaddr) do? >>> >>> I was thinking this just does the hackery to set a flag bit in the >>> pointer, something like: >>> >>> return (void *) ((unsigned long) kaddr | DAX_RECOVERY) >> >> Okay, how about call it dax_prep_recovery()? >> >>> >>>> >>>>> dax_direct_access(..., &kaddr, ...); >>>>> return dax_recovery_{read,write}(..., kaddr, ...); >>>>> } >>>>> return copy_{mc_to_iter,from_iter_flushcache}(...); >>>>> >>>>> Where, the recovery version of dax_direct_access() has the opportunity >>>>> to change the page permissions / use an alias mapping for the access, >>>> >>>> again, sorry, what 'page permissions'? memory_failure_dev_pagemap() >>>> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?), >>>> do you mean to reverse the change? >>> >>> Right, the result of the conversation with Boris is that >>> memory_failure() should mark the page as NP in call cases, so >>> dax_direct_access() needs to create a UC mapping and >>> dax_recover_{read,write}() would sink that operation and either return >>> the page to NP after the access completes, or convert it to WB if the >>> operation cleared the error. >> >> Okay, will add a patch to fix set_mce_nospec(). >> >> How about moving set_memory_uc() and set_memory_np() down to >> dax_recovery_read(), so that we don't split the set_memory_X calls >> over different APIs, because we can't enforce what follows >> dax_direct_access()? >> >>> >>>>> dax_recovery_read() allows reading the good cachelines out of a >>>>> poisoned page, and dax_recovery_write() coordinates error list >>>>> management and returning a poison page to full write-back caching >>>>> operation when no more poisoned cacheline are detected in the page. >>>>> >>>> >>>> How about to introduce 3 dax_recover_ APIs: >>>> dax_recover_direct_access(): similar to dax_direct_access except >>>> it ignores error list and return the kaddr, and hence is also >>>> optional, exported by device driver that has the ability to >>>> detect error; >>>> dax_recovery_read(): optional, supported by pmem driver only, >>>> reads as much data as possible up to the poisoned page; >>> >>> It wouldn't be a property of the pmem driver, I expect it would be a >>> flag on the dax device whether to attempt recovery or not. I.e. get >>> away from this being a pmem callback and make this a native capability >>> of a dax device. >>> >>>> dax_recovery_write(): optional, supported by pmem driver only, >>>> first clear-poison, then write. >>>> >>>> Should we worry about the dm targets? >>> >>> The dm targets after Christoph's conversion should be able to do all >>> the translation at direct access time and then dax_recovery_X can be >>> done on the resulting already translated kaddr. >> >> I'm thinking about the mixed device dm where some provides >> dax_recovery_X, others don't, in which case we don't allow >> dax recovery because that causes confusion? or should we still >> allow recovery for part of the mixed devices? > > I really don't like the all or nothing approach if it can be avoided. > I would imagine that if recovery possible it best to support it even > if the DM device happens to span a mix of devices with varying support > for recovery. Got it! thanks! -jane > > Thanks, > Mike >