Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> · Wed, 16 Mar 2022 21:46:07 +0800

在 2022/3/12 7:35, Dan Williams 写道:
On Sun, Feb 27, 2022 at 4:08 AM Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> wrote:

To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation.  This holder is used to
remember who is using this dax_device:
  - When it is the backend of a filesystem, the holder will be the
    instance of this filesystem.
  - When this pmem device is one of the targets in a mapped device, the
    holder will be this mapped device.  In this case, the mapped device
    has its own dax_device and it will follow the first rule.  So that we
    can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Signed-off-by: Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx>
---
  drivers/dax/super.c | 89 +++++++++++++++++++++++++++++++++++++++++++++
  include/linux/dax.h | 32 ++++++++++++++++
  2 files changed, 121 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e3029389d809..da5798e19d57 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -21,6 +21,9 @@
   * @cdev: optional character interface for "device dax"
   * @private: dax driver private data
   * @flags: state and boolean properties
+ * @ops: operations for dax_device
+ * @holder_data: holder of a dax_device: could be filesystem or mapped device
+ * @holder_ops: operations for the inner holder
   */
  struct dax_device {
         struct inode inode;
@@ -28,6 +31,8 @@ struct dax_device {
         void *private;
         unsigned long flags;
         const struct dax_operations *ops;
+       void *holder_data;
+       const struct dax_holder_operations *holder_ops;
  };

  static dev_t dax_devt;
@@ -193,6 +198,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
  }
  EXPORT_SYMBOL_GPL(dax_zero_page_range);

+int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
+                             u64 len, int mf_flags)
+{
+       int rc, id;
+
+       id = dax_read_lock();
+       if (!dax_alive(dax_dev)) {
+               rc = -ENXIO;
+               goto out;
+       }
+
+       if (!dax_dev->holder_ops) {
+               rc = -EOPNOTSUPP;

I think it is ok to return success (0) for this case. All the caller
of dax_holder_notify_failure() wants to know is if the notification
was successfully delivered to the holder. If there is no holder
present then there is nothing to report. This is minor enough for me
to fix up locally if nothing else needs to be changed.

I thought it could fall back to generic memory failure handler: 
mf_generic_kill_procs(), if holder_ops not exists.


+               goto out;
+       }
+
+       rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+out:
+       dax_read_unlock(id);
+       return rc;
+}
+EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
+
  #ifdef CONFIG_ARCH_HAS_PMEM_API
  void arch_wb_cache_pmem(void *addr, size_t size);
  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
@@ -268,6 +296,10 @@ void kill_dax(struct dax_device *dax_dev)

         clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
         synchronize_srcu(&dax_srcu);
+
+       /* clear holder data */
+       dax_dev->holder_ops = NULL;
+       dax_dev->holder_data = NULL;

Isn't this another failure scenario? If kill_dax() is called while a
holder is still holding the dax_device that seems to be another
->notify_failure scenario to tell the holder that the device is going
away and the holder has not released the device yet.

Yes.  I should call dax_holder_notify_failure() and then unregister the 
holder.


  }
  EXPORT_SYMBOL_GPL(kill_dax);

@@ -409,6 +441,63 @@ void put_dax(struct dax_device *dax_dev)
  }
  EXPORT_SYMBOL_GPL(put_dax);

+/**
+ * dax_holder() - obtain the holder of a dax device
+ * @dax_dev: a dax_device instance
+
+ * Return: the holder's data which represents the holder if registered,
+ * otherwize NULL.
+ */
+void *dax_holder(struct dax_device *dax_dev)
+{
+       if (!dax_alive(dax_dev))
+               return NULL;

It's safe for the holder to assume that it can de-reference
->holder_data freely in its notify_handler callback because
dax_holder_notify_failure() arranges for the callback to run in
dax_read_lock() context.

This is another minor detail that I can fixup locally.

+
+       return dax_dev->holder_data;
+}
+EXPORT_SYMBOL_GPL(dax_holder);
+
+/**
+ * dax_register_holder() - register a holder to a dax device
+ * @dax_dev: a dax_device instance
+ * @holder: a pointer to a holder's data which represents the holder
+ * @ops: operations of this holder
+
+ * Return: negative errno if an error occurs, otherwise 0.
+ */
+int dax_register_holder(struct dax_device *dax_dev, void *holder,
+               const struct dax_holder_operations *ops)
+{
+       if (!dax_alive(dax_dev))
+               return -ENXIO;
+
+       if (cmpxchg(&dax_dev->holder_data, NULL, holder))
+               return -EBUSY;
+
+       dax_dev->holder_ops = ops;
+       return 0;
+}
+EXPORT_SYMBOL_GPL(dax_register_holder);
+
+/**
+ * dax_unregister_holder() - unregister the holder for a dax device
+ * @dax_dev: a dax_device instance
+ * @holder: the holder to be unregistered
+ *
+ * Return: negative errno if an error occurs, otherwise 0.
+ */
+int dax_unregister_holder(struct dax_device *dax_dev, void *holder)
+{
+       if (!dax_alive(dax_dev))
+               return -ENXIO;
+
+       if (cmpxchg(&dax_dev->holder_data, holder, NULL) != holder)
+               return -EBUSY;
+       dax_dev->holder_ops = NULL;
+       return 0;
+}
+EXPORT_SYMBOL_GPL(dax_unregister_holder);
+
  /**
   * inode_dax: convert a public inode into its dax_dev
   * @inode: An inode with i_cdev pointing to a dax_dev
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9fc5f99a0ae2..262d7bad131a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,8 +32,24 @@ struct dax_operations {
         int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
  };

+struct dax_holder_operations {
+       /*
+        * notify_failure - notify memory failure into inner holder device
+        * @dax_dev: the dax device which contains the holder
+        * @offset: offset on this dax device where memory failure occurs
+        * @len: length of this memory failure event

Forgive me if this has been discussed before, but since dax_operations
are in terms of pgoff and nr pages and memory_failure() is in terms of
pfns what was the rationale for making the function signature byte
based?

Maybe I didn't describe it clearly...  The @offset and @len here are 
byte-based.  And so is ->memory_failure().

You can find the implementation of ->memory_failure() in 3rd patch:

+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+		phys_addr_t addr, u64 len, int mf_flags)
+{
+	struct pmem_device *pmem =
+			container_of(pgmap, struct pmem_device, pgmap);
+	u64 offset = addr - pmem->phys_addr - pmem->data_offset;
+
+	return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
+}


I want to get this series merged into linux-next shortly after
v5.18-rc1. Then we can start working on incremental fixups rather
resending the full series with these long reply cycles.


Thanks.  That really helps.


--
Ruan.