Shiyang Ruan wrote: > This patch is inspired by Dan's "mm, dax, pmem: Introduce > dev_pagemap_failure()"[1]. With the help of dax_holder and > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > on it to unmap all files in use, and notify processes who are using > those files. > > Call trace: > trigger unbind > -> unbind_store() > -> ... (skip) > -> devres_release_all() > -> kill_dax() > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) > -> xfs_dax_notify_failure() > `-> freeze_super() // freeze (kernel call) > `-> do xfs rmap > ` -> mf_dax_kill_procs() > ` -> collect_procs_fsdax() // all associated processes > ` -> unmap_and_kill() > ` -> invalidate_inode_pages2_range() // drop file's cache > `-> thaw_super() // thaw (both kernel & user call) > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove > event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent > new dax mapping from being created. Do not shutdown filesystem directly > if configuration is not supported, or if failure range includes metadata > area. Make sure all files and processes(not only the current progress) > are handled correctly. Also drop the cache of associated files before > pmem is removed. I would say more about why this is important for DAX users. Yes, the devm_memremap_pages() vs get_user_pages() infrastructure can be improved if it has a mechanism to revoke all pages that it has handed out for a given device, but that's not an end user visible effect. The end user impact needs to be clear. Is this for existing deployed pmem where a user accidentally removes a device and wants failures and process killing instead of hangs? The reason Linux has got along without this for so long is because pmem is difficult to remove (and with the sunset of Optane, difficult to acquire). One motivation to pursue this is CXL where hotplug is better defined and use cases like dynamic capacity devices where making forward progress to kill processes is better than hanging. It would help to have an example of what happens without this patch. > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/ > > Signed-off-by: Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> > --- > drivers/dax/super.c | 3 +- > fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++--- > include/linux/mm.h | 1 + > mm/memory-failure.c | 17 ++++++-- > 4 files changed, 96 insertions(+), 11 deletions(-) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index c4c4728a36e4..2e1a35e82fce 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) > return; > > if (dax_dev->holder_data != NULL) > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); > + dax_holder_notify_failure(dax_dev, 0, U64_MAX, > + MF_MEM_PRE_REMOVE); The motivation in the original proposal was to convey the death of large extents to memory_failure(). However, that proposal predated your mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a new bulk memory_failure() API is gone. This is where the end user impact needs to be clear. It seems that without this patch the filesystem may assume failure while the device is already present, but that seems ok. The goal is forward progress after a mistake not necessarily minimizing damage after a mistake. The fact that the current code is not as gentle could be considered a feature because graceful shutdown should always unmount before unplug, and if one unplugs before unmount it is already understood that they get to keep the pieces. Because the driver ->remove() callback can not enforce that the device is still present it seems unnecessary to optimize for the case where the filesystem is the device is being removed from an actively mounted filesystem, but the device is still present. The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say "userspace failed to umount before hardware eject, stop trying to access this range", rather than "try to finish up in this range, but it might already be too late".