Re: [PATCH v13 5/7] mm: Introduce mf_dax_kill_procs() for fsdax case

HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> · Thu, 21 Apr 2022 12:50:13 +0000

On Tue, Apr 19, 2022 at 12:50:43PM +0800, Shiyang Ruan wrote:
> This new function is a variant of mf_generic_kill_procs that accepts a
> file, offset pair instead of a struct to support multiple files sharing
> a DAX mapping.  It is intended to be called by the file systems as part
> of the memory_failure handler after the file system performed a reverse
> mapping from the storage address to the file and file offset.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx>
> Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> ---
...

> @@ -539,13 +548,41 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  			 * to be informed of all such data corruptions.
>  			 */
>  			if (vma->vm_mm == t->mm)
> -				add_to_kill(t, page, vma, to_kill);
> +				add_to_kill(t, page, 0, vma, to_kill);
>  		}
>  	}
>  	read_unlock(&tasklist_lock);
>  	i_mmap_unlock_read(mapping);
>  }
>  
> +#if IS_ENABLED(CONFIG_FS_DAX)

This macro is equivalent with "#ifdef CONFIG_FS_DAX", and you also add "#ifdef" below,
so aligning to either (I prefer "#ifdef") might be better.

> +/*
> + * Collect processes when the error hit a fsdax page.
> + */
> +static void collect_procs_fsdax(struct page *page,
> +		struct address_space *mapping, pgoff_t pgoff,
> +		struct list_head *to_kill)

Unlike collect_procs_file(), this new function does not have parameter
force_early, and passes true unconditionally to task_early_kill().

Looking at the current code, I noticed the following code and comment:

	/*
	 * Unlike System-RAM there is no possibility to swap in a
	 * different physical page at a given virtual address, so all
	 * userspace consumption of ZONE_DEVICE memory necessitates
	 * SIGBUS (i.e. MF_MUST_KILL)
	 */
	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;

, which forcibly sets MF_ACTION_REQUIRED and I guess that this is the reason
for passing true above.  But now I'm not sure that setting these flags for
all error events on NVDIMM is really right thing.  The above comment sounds to
me that memory_failure_dev_pagemap() is called to handle the event when the data
on ZONE_DEVICE memory is about to be accessed/consumed, but is that the only case?

I thought that memory_failure() can be called by memory scrubbing *before*
some process actually access to it, and MCE handler judges whether action is
required or not based on MSRs.  Does this not happen on NVDIMM error?

Anyway this question might be a little off-topic, so not to be a blocker for
this patchset.

Thanks,
Naoya Horiguchi