Re: [Patch v4 1/6] KVM: Support for guest page hinting

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Wed, 15 Nov 2017 22:06:10 +0200

On Wed, Nov 15, 2017 at 02:38:20PM -0500, Nitesh Narayan Lal wrote:
> Hi Michael,
> 
> 
> On 11/13/2017 12:59 PM, Michael S. Tsirkin wrote:
> > On Fri, Nov 03, 2017 at 04:30:08PM -0400, nilal@xxxxxxxxxx wrote:
> >> From: Nitesh Narayan Lal <nilal@xxxxxxxxxx>
> >>
> >> This patch includes the following:
> >> 1. Basic skeleton for the support
> >> 2. Enablement of x86 platform to use the same
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nilal@xxxxxxxxxx>
> >> ---
> >>  arch/x86/Kbuild         |  2 +-
> >>  arch/x86/kvm/Makefile   |  2 ++
> >>  include/linux/gfp.h     |  7 +++++++
> >>  virt/kvm/Kconfig        |  4 ++++
> >>  virt/kvm/page_hinting.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  5 files changed, 60 insertions(+), 1 deletion(-)
> >>  create mode 100644 virt/kvm/page_hinting.c
> >>
> >> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
> >> index 0038a2d..7d39d7d 100644
> >> --- a/arch/x86/Kbuild
> >> +++ b/arch/x86/Kbuild
> >> @@ -2,7 +2,7 @@ obj-y += entry/
> >>  
> >>  obj-$(CONFIG_PERF_EVENTS) += events/
> >>  
> >> -obj-$(CONFIG_KVM) += kvm/
> >> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/
> >>  
> >>  # Xen paravirtualization support
> >>  obj-$(CONFIG_XEN) += xen/
> >> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> >> index 09d4b17..d8a3800 100644
> >> --- a/arch/x86/kvm/Makefile
> >> +++ b/arch/x86/kvm/Makefile
> >> @@ -15,6 +15,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
> >>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> >>  			   hyperv.o page_track.o debugfs.o
> >>  
> >> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING)	+= $(KVM)/page_hinting.o
> >> +
> >>  kvm-intel-y		+= vmx.o pmu_intel.o
> >>  kvm-amd-y		+= svm.o pmu_amd.o
> >>  
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index f780718..a74371f 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -452,6 +452,13 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >>  	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >>  }
> >>  
> >> +#ifdef	CONFIG_KVM_FREE_PAGE_HINTING
> >> +#define HAVE_ARCH_ALLOC_PAGE
> >> +#define HAVE_ARCH_FREE_PAGE
> >> +void arch_free_page(struct page *page, int order);
> >> +void arch_alloc_page(struct page *page, int order);
> >> +#endif
> >> +
> >>  #ifndef HAVE_ARCH_FREE_PAGE
> >>  static inline void arch_free_page(struct page *page, int order) { }
> >>  #endif
> >
> > So the idea is to send pages to host in arch_free_page?
> >
> > However, I notice:
> >
> >         arch_free_page(page, order);
> >         kernel_poison_pages(page, 1 << order, 0);
> >
> > Thus pages are immediately written to after they are freed.
> >
> >
> >
> >
> >
> >> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> >> index b0cc1a3..936c71d 100644
> >> --- a/virt/kvm/Kconfig
> >> +++ b/virt/kvm/Kconfig
> >> @@ -50,3 +50,7 @@ config KVM_COMPAT
> >>  
> >>  config HAVE_KVM_IRQ_BYPASS
> >>         bool
> >> +
> >> +config KVM_FREE_PAGE_HINTING
> >> +       def_bool y
> >> +       depends on KVM
> >> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c
> >> new file mode 100644
> >> index 0000000..39d2b1d
> >> --- /dev/null
> >> +++ b/virt/kvm/page_hinting.c
> >> @@ -0,0 +1,46 @@
> >> +#include <linux/gfp.h>
> >> +#include <linux/mm.h>
> >> +#include <linux/page_ref.h>
> >> +#include <linux/kvm_host.h>
> >> +#include <linux/sort.h>
> >> +
> >> +#include <trace/events/kvm.h>
> >> +
> >> +#define MAX_FGPT_ENTRIES	1000
> >> +#define HYPERLIST_THRESHOLD	500
> >> +/*
> >> + * struct kvm_free_pages - Tracks the pages which are freed by the guest.
> >> + * @pfn	- page frame number for the page which is to be freed
> >> + * @pages - number of pages which are supposed to be freed.
> >> + * A global array object is used to hold the list of pfn and number of pages
> >> + * which are freed by the guest. This list may also have fragmentated pages so
> >> + * defragmentation is a must prior to the hypercall.
> >> + */
> >> +struct kvm_free_pages {
> >> +	unsigned long pfn;
> >> +	unsigned int pages;
> >> +};
> >> +
> >> +/*
> >> + * hypervisor_pages - It is a dummy structure passed with the hypercall.
> >> + * @pfn - page frame number for the page which is to be freed.
> >> + * @pages - number of pages which are supposed to be freed.
> >> + * A global array object is used to to hold the list of pfn and pages and is
> >> + * passed as part of the hypercall.
> >> + */
> >> +struct hypervisor_pages {
> >> +	unsigned long pfn;
> >> +	unsigned int pages;
> >> +};
> >> +
> >> +DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt);
> >> +DEFINE_PER_CPU(int, kvm_pt_idx);
> >> +struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES];
> >> +
> >> +void arch_alloc_page(struct page *page, int order)
> >> +{
> >> +}
> >> +
> >> +void arch_free_page(struct page *page, int order)
> >> +{
> >> +}
> >> -- 
> >> 2.9.4
> Thanks Michael for pointing out this bug.
> The issue here is when both 'Guest Page Hinting' and 'Page Poisoning'
> are enabled as it may result in guest generating memory corruption
> errors when the hypervisor maps a different physical memory to the guest
> after freeing the earlier mapped memory.
> 
> Page Poisoning is a feature in which the page is filled with a specific
> pattern of (0x00 or 0xaa) after arch_free_page and the same is verified
> before arch_alloc_page to prevent following issues:
> 
>     *information leak from the freed data
>     *use after free bugs
>     *memory corruption
> 
> -Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO
> 
> Guest page hinting is a support for handing freed memory between the
> guest and the host. It enables the guests with DAX (no page cache) to
> rapidly free and reclaims memory to and from the host respectively. To
> achieve this task it uses two lists one CPU-local and other CPU-global. 
> Whenever a page is freed it is added to the respective CPU-local list
> through arch_free_page() until it is full. Once the list is full a
> seqlock is taken to prevent any further page allocations and the per
> CPU-local list is traversed and only those pages which are not
> reallocated by the guest are added to the CPU-global list. This global
> list is then sent to the hypervisor/host.
> 
> After freeing the pages in the global list following things may happen:
> 
>     *Hypervisor reallocates the freed memory back to the guest if required
>     *Hypervisor frees the memory and maps a different physical memory at
> its place
> 
> In order to prevent any information leak hypervisor before allocating
> memory to the guest fills it with zeroes.
> The issue arises when the pattern used for Page Poisoning is 0xaa while
> the newly allocated page received from the hypervisor by the guest is
> filled with the pattern 0x00. This will result in memory corruption errors.
> As per the current implementation in order to enable page poisoning a
> boot time parameter (page_poison) is set. Hence  there could be two
> following ways by which we could resolve the above-stated issue:
> 
> 1. We can disable page poisoning if guest page hinting is enabled. The
> only code change required in this case will be to check if
> CONFIG_PAGE_POISONING is enabled, if so disable it so that poisoning is
> not checked once we receive newly allocated memory from the hypervisor.
> 
> 2. We can have a flag in the page structure using which we can control
> poisoning on a per page basis. Using this we can disable the poisoning
> for only those pages which are sent to the hypervisor so that posioning
> is not checked on the newly allocated memory which is received by the
> guest form the hypervisor. Though in order to incorporate this I may
> have to make changes in page_poison.c, mm_types.h and other related files.
> 
> I am looking forward to the comments and suggestions.
> 
> [1] http://www.spinics.net/lists/kvm/msg153666.html
> [2] https://www.spinics.net/lists/kvm/msg158054.html

Let's not discuss this off-list. Please repost with Cc a
mailing list so the discussion is archived.

Thanks!

> -- 
> Regards
> Nitesh
>