RE: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

Dan Williams <dan.j.williams@xxxxxxxxx> · Mon, 7 Aug 2023 17:31:18 -0700

Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.

I would say more about why this is important for DAX users. Yes, the
devm_memremap_pages() vs get_user_pages() infrastructure can be improved
if it has a mechanism to revoke all pages that it has handed out for a
given device, but that's not an end user visible effect.

The end user impact needs to be clear. Is this for existing deployed
pmem where a user accidentally removes a device and wants failures and
process killing instead of hangs?

The reason Linux has got along without this for so long is because pmem
is difficult to remove (and with the sunset of Optane, difficult to
acquire). One motivation to pursue this is CXL where hotplug is better
defined and use cases like dynamic capacity devices where making forward
progress to kill processes is better than hanging.

It would help to have an example of what happens without this patch.

> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx>
> ---
>  drivers/dax/super.c         |  3 +-
>  fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>  include/linux/mm.h          |  1 +
>  mm/memory-failure.c         | 17 ++++++--
>  4 files changed, 96 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);

The motivation in the original proposal was to convey the death of
large extents to memory_failure(). However, that proposal predated your
mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a
new bulk memory_failure() API is gone.

This is where the end user impact needs to be clear. It seems that
without this patch the filesystem may assume failure while the device is
already present, but that seems ok. The goal is forward progress after a
mistake not necessarily minimizing damage after a mistake. The fact that
the current code is not as gentle could be considered a feature because
graceful shutdown should always unmount before unplug, and if one
unplugs before unmount it is already understood that they get to keep
the pieces.

Because the driver ->remove() callback can not enforce that the device
is still present it seems unnecessary to optimize for the case where the
filesystem is the device is being removed from an actively mounted
filesystem, but the device is still present.

The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say
"userspace failed to umount before hardware eject, stop trying to access
this range", rather than "try to finish up in this range, but it might
already be too late".