Re: [PATCH v8 7/7] pmem: implement pmem_recovery_write()

Jane Chu <jane.chu@xxxxxxxxxx> · Wed, 20 Apr 2022 17:01:15 +0000

On 4/19/2022 11:26 PM, Dan Williams wrote:
> On Tue, Apr 19, 2022 at 7:06 PM Jane Chu <jane.chu@xxxxxxxxxx> wrote:
>>
>> The recovery write thread started out as a normal pwrite thread and
>> when the filesystem was told about potential media error in the
>> range, filesystem turns the normal pwrite to a dax_recovery_write.
>>
>> The recovery write consists of clearing media poison, clearing page
>> HWPoison bit, reenable page-wide read-write permission, flush the
>> caches and finally write.  A competing pread thread will be held
>> off during the recovery process since data read back might not be
>> valid, and this is achieved by clearing the badblock records after
>> the recovery write is complete. Competing recovery write threads
>> are serialized by pmem device level .recovery_lock.
>>
>> Signed-off-by: Jane Chu <jane.chu@xxxxxxxxxx>
>> ---
>>   drivers/nvdimm/pmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++-
>>   drivers/nvdimm/pmem.h |  1 +
>>   2 files changed, 56 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index c3772304d417..134f8909eb65 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -332,10 +332,63 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
>>          return __pmem_direct_access(pmem, pgoff, nr_pages, mode, kaddr, pfn);
>>   }
>>
>> +/*
>> + * The recovery write thread started out as a normal pwrite thread and
>> + * when the filesystem was told about potential media error in the
>> + * range, filesystem turns the normal pwrite to a dax_recovery_write.
>> + *
>> + * The recovery write consists of clearing media poison, clearing page
>> + * HWPoison bit, reenable page-wide read-write permission, flush the
>> + * caches and finally write.  A competing pread thread will be held
>> + * off during the recovery process since data read back might not be
>> + * valid, and this is achieved by clearing the badblock records after
>> + * the recovery write is complete. Competing recovery write threads
>> + * are serialized by pmem device level .recovery_lock.
>> + */
>>   static size_t pmem_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
>>                  void *addr, size_t bytes, struct iov_iter *i)
>>   {
>> -       return 0;
>> +       struct pmem_device *pmem = dax_get_private(dax_dev);
>> +       size_t olen, len, off;
>> +       phys_addr_t pmem_off;
>> +       struct device *dev = pmem->bb.dev;
>> +       long cleared;
>> +
>> +       off = offset_in_page(addr);
>> +       len = PFN_PHYS(PFN_UP(off + bytes));
>> +       if (!is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) >> SECTOR_SHIFT, len))
>> +               return _copy_from_iter_flushcache(addr, bytes, i);
>> +
>> +       /*
>> +        * Not page-aligned range cannot be recovered. This should not
>> +        * happen unless something else went wrong.
>> +        */
>> +       if (off || !PAGE_ALIGNED(bytes)) {
>> +               dev_warn(dev, "Found poison, but addr(%p) or bytes(%#lx) not page aligned\n",
>> +                       addr, bytes);
> 
> If this warn stays:
> 
> s/dev_warn/dev_warn_ratelimited/
> 
> The kernel prints hashed addresses for %p, so I'm not sure printing
> @addr is useful or @bytes because there is nothing actionable that can
> be done with that information in the log. @pgoff is probably the only
> variable worth printing (after converting to bytes or sectors) as that
> might be able to be reverse mapped back to the impacted data.

The intention with printing @addr and @bytes is to show the 
mis-alignment. In the past when UC- was set on poisoned dax page, 
returning less than a page being written would cause dax_iomap_iter to 
produce next iteration with @addr and @bytes not-page-aligned.  Although 
UC- doesn't apply here, I thought it might still be worth while to watch 
for similar scenario.  Also that's why @pgoff isn't helpful.

How about s/dev_warn/dev_dbg/ ?

> 
>> +               return 0;
>> +       }
>> +
>> +       mutex_lock(&pmem->recovery_lock);
>> +       pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>> +       cleared = __pmem_clear_poison(pmem, pmem_off, len);
>> +       if (cleared > 0 && cleared < len) {
>> +               dev_warn(dev, "poison cleared only %ld out of %lu bytes\n",
>> +                       cleared, len);
> 
> This looks like dev_dbg() to me, or at minimum the same
> dev_warn_ratelimited() print as the one above to just indicate a write
> to the device at the given offset failed.

Will s/dev_warn/dev_dbg/

> 
>> +               mutex_unlock(&pmem->recovery_lock);
>> +               return 0;
>> +       }
>> +       if (cleared < 0) {
>> +               dev_warn(dev, "poison clear failed: %ld\n", cleared);
> 
> Same feedback here, these should probably all map to the identical
> error exit ratelimited print.

Will s/dev_warn/dev_dbg/

> 
>> +               mutex_unlock(&pmem->recovery_lock);
> 
> It occurs to me that all callers of this are arriving through the
> fsdax iomap ops and all of these callers take an exclusive lock to
> prevent simultaneous access to the inode. If recovery_write() is only
> used through that path then this lock is redundant. Simultaneous reads
> are protected by the fact that the badblocks are cleared last. I think
> this can wait to add the lock when / if a non-fsdax access path
> arrives for dax_recovery_write(), and even then it should probably
> enforce the single writer exclusion guarantee of the fsdax path.
> 

Indeed, the caller dax_iomap_rw has already held the writer lock.

Will remove .recovery_lock for now.

BTW, how are the other patches look to you?

Thanks!
-jane

>> +               return 0;
>> +       }
>> +
>> +       olen = _copy_from_iter_flushcache(addr, bytes, i);
>> +       pmem_clear_bb(pmem, to_sect(pmem, pmem_off), cleared >> SECTOR_SHIFT);
>> +
>> +       mutex_unlock(&pmem->recovery_lock);
>> +       return olen;
>>   }
>>
>>   static const struct dax_operations pmem_dax_ops = {
>> @@ -525,6 +578,7 @@ static int pmem_attach_disk(struct device *dev,
>>          if (rc)
>>                  goto out_cleanup_dax;
>>          dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
>> +       mutex_init(&pmem->recovery_lock);
>>          pmem->dax_dev = dax_dev;
>>
>>          rc = device_add_disk(dev, disk, pmem_attribute_groups);
>> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
>> index 392b0b38acb9..91e40f5e3c0e 100644
>> --- a/drivers/nvdimm/pmem.h
>> +++ b/drivers/nvdimm/pmem.h
>> @@ -27,6 +27,7 @@ struct pmem_device {
>>          struct dax_device       *dax_dev;
>>          struct gendisk          *disk;
>>          struct dev_pagemap      pgmap;
>> +       struct mutex            recovery_lock;
>>   };
>>
>>   long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> --
>> 2.18.4
>>