Re: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Tue, 03 Oct 2023 19:54:31 +0300

У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> From: David Stevens <stevensd@xxxxxxxxxxxx>
> 
> KVM's handling of non-refcounted pfns has two problems:
> 
>  - struct pages without refcounting (e.g. tail pages of non-compound
>    higher order pages) cannot be used at all, as gfn_to_pfn does not
>    provide enough information for callers to handle the refcount.
>  - pfns without struct pages can be accessed without the protection of a
>    mmu notifier. This is unsafe because KVM cannot monitor or control
>    the lifespan of such pfns, so it may continue to access the pfns
>    after they are freed.
> 
> This patch extends the __kvm_follow_pfn API to properly handle these
> cases. 

> First, it adds a is_refcounted_page output parameter so that
> callers can tell whether or not a pfn has a struct page that needs to be
> passed to put_page. 

> Second, it adds a guarded_by_mmu_notifier parameter
> that is used to avoid returning non-refcounted pages when the caller
> cannot safely use them.
> 
> Since callers need to be updated on a case-by-case basis to pay
> attention to is_refcounted_page, the new behavior of returning
> non-refcounted pages is opt-in via the allow_non_refcounted_struct_page
> parameter. Once all callers have been updated, this parameter should be
> removed.

Small note: since these new parameters are critical for understanding the patch,
Maybe it makes sense to re-order their description to match the order in the struct 
(or at least put the output parameter at the end of the description),
and give each a separate paragraph as I did above.

> 
> The fact that non-refcounted pfns can no longer be accessed without mmu
> notifier protection is a breaking change. Since there is no timeline for
> updating everything in KVM to use mmu notifiers, this change adds an
> opt-in module parameter called allow_unsafe_mappings to allow such
> mappings. Systems which trust userspace not to tear down such unsafe
> mappings while KVM is using them can set this parameter to re-enable the
> legacy behavior.

Do you have a practical example of a VM that can break with this change?
E.g will a normal VM break? will a VM with VFIO devices break? Will a VM with
hugepages mapped into it break?

Will the trick of limiting the kernel memory with 'mem=X', and then use the 
extra 'upper memory' for VMs still work?

> 
> Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx>
> ---
>  include/linux/kvm_host.h | 21 ++++++++++
>  virt/kvm/kvm_main.c      | 84 ++++++++++++++++++++++++----------------
>  virt/kvm/pfncache.c      |  1 +
>  3 files changed, 72 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c2e0ddf14dba..2ed08ae1a9be 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Try to create a writable mapping even for a read fault */
>  	bool try_map_writable;
> +	/* Usage of the returned pfn will be guared by a mmu notifier. */
> +	bool guarded_by_mmu_notifier;
> +	/*
> +	 * When false, do not return pfns for non-refcounted struct pages.
> +	 *
> +	 * TODO: This allows callers to use kvm_release_pfn on the pfns
> +	 * returned by gfn_to_pfn without worrying about corrupting the
> +	 * refcounted of non-refcounted pages. Once all callers respect
Typo: refcount.
> +	 * is_refcounted_page, this flag should be removed.
> +	 */
> +	bool allow_non_refcounted_struct_page;
>  
>  	/* Outputs of __kvm_follow_pfn */
>  	hva_t hva;
>  	bool writable;
> +	/*
> +	 * True if the returned pfn is for a page with a valid refcount. False
> +	 * if the returned pfn has no struct page or if the struct page is not
> +	 * being refcounted (e.g. tail pages of non-compound higher order
> +	 * allocations from IO/PFNMAP mappings).
> 
Aren't all tail pages not-refcounted (e.g tail page of a hugepage?)
I haven't researched this topic yet.

> +	 *
> +	 * When this output flag is false, callers should not try to convert
> +	 * the pfn to a struct page.
> +	 */
> +	bool is_refcounted_page;
>  };
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9b33a59c6d65..235c5cb3fdac 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink;
>  module_param(halt_poll_ns_shrink, uint, 0644);
>  EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
>  
> +/* Allow non-struct page memory to be mapped without MMU notifier protection. */
> +static bool allow_unsafe_mappings;
> +module_param(allow_unsafe_mappings, bool, 0444);
> +
>  /*
>   * Ordering of locks:
>   *
> @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>  	return rc == -EHWPOISON;
>  }
>  
> +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> +					   struct page *page)
> +{
> +	kvm_pfn_t pfn = page_to_pfn(page);
> +
> +	foll->is_refcounted_page = true;
> +	return pfn;
> +}

Just a matter of taste but to me this function looks confusing.
IMHO, just duplicating these two lines of code is better.
However if you prefer I won't argue over this.

> +
>  /*
>   * The fast path to get the writable pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.  It's also the
> @@ -2525,7 +2538,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		return false;
>  
>  	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> -		*pfn = page_to_pfn(page[0]);
> +		*pfn = kvm_follow_refcounted_pfn(foll, page[0]);

Yep, here just 'foll->is_refcounted_page = true;' looks more readable to me.

>  		foll->writable = true;
>  		return true;
>  	}
> @@ -2561,7 +2574,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  			page = wpage;
>  		}
>  	}
> -	*pfn = page_to_pfn(page);
> +	*pfn = kvm_follow_refcounted_pfn(foll, page);
Same here and probably in other places.
>  	return npages;
>  }
>  
> @@ -2576,16 +2589,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>  	return true;
>  }
>  
> -static int kvm_try_get_pfn(kvm_pfn_t pfn)
> -{
> -	struct page *page = kvm_pfn_to_refcounted_page(pfn);
> -
> -	if (!page)
> -		return 1;
> -
> -	return get_page_unless_zero(page);
> -}
> -
>  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)
>  {
> @@ -2594,6 +2597,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  	pte_t pte;
>  	spinlock_t *ptl;
>  	bool write_fault = foll->flags & FOLL_WRITE;
> +	struct page *page;
>  	int r;
>  
>  	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> @@ -2618,37 +2622,39 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  
>  	pte = ptep_get(ptep);
>  
> +	foll->writable = pte_write(pte);
> +	pfn = pte_pfn(pte);
> +
> +	page = kvm_pfn_to_refcounted_page(pfn);
> +
>  	if (write_fault && !pte_write(pte)) {
>  		pfn = KVM_PFN_ERR_RO_FAULT;
>  		goto out;
>  	}
>  
> -	foll->writable = pte_write(pte);
> -	pfn = pte_pfn(pte);
> +	if (!page)
> +		goto out;
>  
> -	/*
> -	 * Get a reference here because callers of *hva_to_pfn* and
> -	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> -	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> -	 * simply do nothing for reserved pfns.
> -	 *
> -	 * Whoever called remap_pfn_range is also going to call e.g.
> -	 * unmap_mapping_range before the underlying pages are freed,
> -	 * causing a call to our MMU notifier.
> -	 *
> -	 * Certain IO or PFNMAP mappings can be backed with valid
> -	 * struct pages, but be allocated without refcounting e.g.,
> -	 * tail pages of non-compound higher order allocations, which
> -	 * would then underflow the refcount when the caller does the
> -	 * required put_page. Don't allow those pages here.
> -	 */

Why the comment is removed? as far as I see the code still grabs a reference to the page.

> -	if (!kvm_try_get_pfn(pfn))
> -		r = -EFAULT;
> +	if (get_page_unless_zero(page))
> +		WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);

Once again, the kvm_follow_refcounted_pfn usage is confusing IMHO. 
It sets the 'foll->is_refcounted_page', and yet someone can think that it's only there for the WARN_ON_ONCE.

That IMHO would read better:

if (get_page_unless_zero(page))
	foll->is_refcounted_page = true;

WARN_ON_ONCE(page_to_pfn(page) != pfn);

Note that I moved the warn out of the 'get_page_unless_zero' condition
because I think that this condition should be true for non refcounted pages as well.

Also I don't understand why 'get_page_unless_zero(page)' result is ignored. As I understand it,
it will increase refcount of a page unless it is zero. 

If a refcount of a refcounted page is 0 isn't that a bug?

The page was returned from kvm_pfn_to_refcounted_page which supposed only to return pages that are refcounted.

I might not understand something in regard to 'get_page_unless_zero(page)' usage both in old and the new code.

>  
>  out:
>  	pte_unmap_unlock(ptep, ptl);
> -	*p_pfn = pfn;
> +
> +	/*
> +	 * TODO: Remove the first branch once all callers have been
> +	 * taught to play nice with non-refcounted struct pages.
> +	 */
> +	if (page && !foll->is_refcounted_page &&
> +	    !foll->allow_non_refcounted_struct_page) {
> +		r = -EFAULT;
> +	} else if (!foll->is_refcounted_page &&
> +		   !foll->guarded_by_mmu_notifier &&
> +		   !allow_unsafe_mappings) {
> +		r = -EFAULT;
> +	} else {
> +		*p_pfn = pfn;
> +	}
>  
>  	return r;
>  }
> @@ -2722,6 +2728,8 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
>  	foll->writable = false;
> +	foll->is_refcounted_page = false;
> +
>  	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
>  				      foll->flags & FOLL_WRITE);
>  
> @@ -2749,6 +2757,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>  		.flags = 0,
>  		.atomic = atomic,
>  		.try_map_writable = !!writable,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  
>  	if (write_fault)
> @@ -2780,6 +2789,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		.gfn = gfn,
>  		.flags = write_fault ? FOLL_WRITE : 0,
>  		.try_map_writable = !!writable,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  	pfn = __kvm_follow_pfn(&foll);
>  	if (writable)
> @@ -2794,6 +2804,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  		.slot = slot,
>  		.gfn = gfn,
>  		.flags = FOLL_WRITE,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  	return __kvm_follow_pfn(&foll);
>  }
> @@ -2806,6 +2817,11 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
>  		.gfn = gfn,
>  		.flags = FOLL_WRITE,
>  		.atomic = true,
> +		/*
> +		 * Setting atomic means __kvm_follow_pfn will never make it
> +		 * to hva_to_pfn_remapped, so this is vacuously true.
> +		 */
> +		.allow_non_refcounted_struct_page = true,
>  	};
>  	return __kvm_follow_pfn(&foll);
>  }
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 86cd40acad11..6bbf972c11f8 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -149,6 +149,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		.gfn = gpa_to_gfn(gpc->gpa),
>  		.flags = FOLL_WRITE,
>  		.hva = gpc->uhva,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);

Best regards,
	Maxim Levitsky