Re: [PATCH v2 5/7] mm/hmm: make full use of walk_page_range()

Jason Gunthorpe <jgg@xxxxxxxxxxxx> · Fri, 26 Jul 2019 15:12:27 +0000

On Thu, Jul 25, 2019 at 05:56:48PM -0700, Ralph Campbell wrote:
> hmm_range_fault() calls find_vma() and walk_page_range() in a loop.
> This is unnecessary duplication since walk_page_range() calls find_vma()
> in a loop already.
> Simplify hmm_range_fault() by defining a walk_test() callback function
> to filter unhandled vmas.
> 
> Signed-off-by: Ralph Campbell <rcampbell@xxxxxxxxxx>
> Cc: "Jérôme Glisse" <jglisse@xxxxxxxxxx>
> Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxx>
>  mm/hmm.c | 130 ++++++++++++++++++++++++-------------------------------
>  1 file changed, 57 insertions(+), 73 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 1bc014cddd78..838cd1d50497 100644
> +++ b/mm/hmm.c
> @@ -840,13 +840,44 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>  #endif
>  }
>  
> -static void hmm_pfns_clear(struct hmm_range *range,
> -			   uint64_t *pfns,
> -			   unsigned long addr,
> -			   unsigned long end)
> +static int hmm_vma_walk_test(unsigned long start,
> +			     unsigned long end,
> +			     struct mm_walk *walk)
>  {
> -	for (; addr < end; addr += PAGE_SIZE, pfns++)
> -		*pfns = range->values[HMM_PFN_NONE];
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct hmm_range *range = hmm_vma_walk->range;
> +	struct vm_area_struct *vma = walk->vma;
> +
> +	/* If range is no longer valid, force retry. */
> +	if (!range->valid)
> +		return -EBUSY;
> +
> +	/*
> +	 * Skip vma ranges that don't have struct page backing them or
> +	 * map I/O devices directly.
> +	 * TODO: handle peer-to-peer device mappings.
> +	 */
> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP))
> +		return -EFAULT;
> +
> +	if (is_vm_hugetlb_page(vma)) {
> +		if (huge_page_shift(hstate_vma(vma)) != range->page_shift &&
> +		    range->page_shift != PAGE_SHIFT)
> +			return -EINVAL;
> +	} else {
> +		if (range->page_shift != PAGE_SHIFT)
> +			return -EINVAL;
> +	}
> +
> +	/*
> +	 * If vma does not allow read access, then assume that it does not
> +	 * allow write access, either. HMM does not support architectures
> +	 * that allow write without read.
> +	 */
> +	if (!(vma->vm_flags & VM_READ))
> +		return -EPERM;
> +
> +	return 0;
>  }
>  
>  /*
> @@ -965,82 +996,35 @@ EXPORT_SYMBOL(hmm_range_unregister);
>   */
>  long hmm_range_fault(struct hmm_range *range, unsigned int flags)
>  {
> -	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
> -	unsigned long start = range->start, end;
> -	struct hmm_vma_walk hmm_vma_walk;
> +	unsigned long start = range->start;
> +	struct hmm_vma_walk hmm_vma_walk = {};
>  	struct hmm *hmm = range->hmm;
> -	struct vm_area_struct *vma;
> -	struct mm_walk mm_walk;
> +	struct mm_walk mm_walk = {};
>  	int ret;
>  
>  	lockdep_assert_held(&hmm->mm->mmap_sem);
>  
> -	do {
> -		/* If range is no longer valid force retry. */
> -		if (!range->valid)
> -			return -EBUSY;
> +	hmm_vma_walk.range = range;
> +	hmm_vma_walk.last = start;
> +	hmm_vma_walk.flags = flags;
> +	mm_walk.private = &hmm_vma_walk;
>  
> -		vma = find_vma(hmm->mm, start);
> -		if (vma == NULL || (vma->vm_flags & device_vma))
> -			return -EFAULT;

It is hard to tell what is a confused/wrong and what is deliberate in
this code...

Currently the hmm_range_fault invokes walk_page_range on a VMA by VMA
basis, and the above prevents some cases of walk->vma becoming
NULL, but not all - for instance it doesn't check for start < vma->vm_start.

However, checking if it can actually tolerate the walk->vma == NULL it
looks like no:

 walk_page_range
  find_vma == NULL || start < vm_start -> walk->vma == NULL
  __walk_page_range
    walk_pgd_range
      pte_hole / hmm_vma_walk_hole
        hmm_vma_walk_hole_
         hmm_vma_do_fault
            handle_mm_fault(walk->vma, addr, flags)
              vma->vm_mm <-- OOPS

Which kind of suggests the find_vma above was about preventing
walk->vma == NULL? Does something else tricky prevent this?

This patch also changes behavior so that missing VMAs don't always
trigger EFAULT (which is a good thing, but needs to be in the commit
message)

I strongly believe this is the correct direction to go in, and the fact
that this function returns EFAULT if there is no VMA/incompatible VMA
is actually a semantic bug we need to fix before it is a usable API.

Ie consider the user does something like
  ptr = mmap(0, PAGE_SIZE ..)
  mr = ib_reg_mr(ptr - PAGE_SIZE, ptr + 3*PAGE_SIZE, IBV_ACCESS_ON_DEMAND)

Then in the kernel I want to do hmm_range_fault(HMM_FAULT_SNAPSHOT)
across the MR VA and get a pfns array that says PAGE 0 is FAULT, PAGE
1 is R/W, PAGE 2 is FAULT.

Instead the entire call fails because there is no VMA at the starting
offset, or the VMA had the wrong flags, or something.

What it should do is populate the result with FAULT for the gap part
of the VA range and continue to the next VMA.

The same comment applies to the implementation of the walker test
function, it should return 1 to skip the VMA and fill PFNS with FAULT
when there is a mismatch VMA, not fail entirely.

Perhaps there was some thought that the fault version should fail to
tell the pagefault handler there is nothing to DMA, but even that is
not entirely desirable, I'd like to have 'fault around' semantics, if
we are going to all the work of doing a few PTEs, lets do a chunk. I
only care if the critical page(s) triggering the fault couldn't be
faulted in, the others can remain as pfn FAULT.

To proceed with this patch we need to confirm/deny the above trace. I
think it probably can be fixed easily (as another patch) by checking
for walk->vma == NULL in the right places.

I really would like to see a test for this function too :( It has lots
and lots of edge cases that need the be comprehensively explored
before we can call this working..

Jason