On Wed, Nov 15, 2017 at 02:38:20PM -0500, Nitesh Narayan Lal wrote: > Hi Michael, > > > On 11/13/2017 12:59 PM, Michael S. Tsirkin wrote: > > On Fri, Nov 03, 2017 at 04:30:08PM -0400, nilal@xxxxxxxxxx wrote: > >> From: Nitesh Narayan Lal <nilal@xxxxxxxxxx> > >> > >> This patch includes the following: > >> 1. Basic skeleton for the support > >> 2. Enablement of x86 platform to use the same > >> > >> Signed-off-by: Nitesh Narayan Lal <nilal@xxxxxxxxxx> > >> --- > >> arch/x86/Kbuild | 2 +- > >> arch/x86/kvm/Makefile | 2 ++ > >> include/linux/gfp.h | 7 +++++++ > >> virt/kvm/Kconfig | 4 ++++ > >> virt/kvm/page_hinting.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > >> 5 files changed, 60 insertions(+), 1 deletion(-) > >> create mode 100644 virt/kvm/page_hinting.c > >> > >> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild > >> index 0038a2d..7d39d7d 100644 > >> --- a/arch/x86/Kbuild > >> +++ b/arch/x86/Kbuild > >> @@ -2,7 +2,7 @@ obj-y += entry/ > >> > >> obj-$(CONFIG_PERF_EVENTS) += events/ > >> > >> -obj-$(CONFIG_KVM) += kvm/ > >> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/ > >> > >> # Xen paravirtualization support > >> obj-$(CONFIG_XEN) += xen/ > >> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile > >> index 09d4b17..d8a3800 100644 > >> --- a/arch/x86/kvm/Makefile > >> +++ b/arch/x86/kvm/Makefile > >> @@ -15,6 +15,8 @@ kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ > >> i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \ > >> hyperv.o page_track.o debugfs.o > >> > >> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING) += $(KVM)/page_hinting.o > >> + > >> kvm-intel-y += vmx.o pmu_intel.o > >> kvm-amd-y += svm.o pmu_amd.o > >> > >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >> index f780718..a74371f 100644 > >> --- a/include/linux/gfp.h > >> +++ b/include/linux/gfp.h > >> @@ -452,6 +452,13 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > >> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > >> } > >> > >> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING > >> +#define HAVE_ARCH_ALLOC_PAGE > >> +#define HAVE_ARCH_FREE_PAGE > >> +void arch_free_page(struct page *page, int order); > >> +void arch_alloc_page(struct page *page, int order); > >> +#endif > >> + > >> #ifndef HAVE_ARCH_FREE_PAGE > >> static inline void arch_free_page(struct page *page, int order) { } > >> #endif > > > > So the idea is to send pages to host in arch_free_page? > > > > However, I notice: > > > > arch_free_page(page, order); > > kernel_poison_pages(page, 1 << order, 0); > > > > Thus pages are immediately written to after they are freed. > > > > > > > > > > > >> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > >> index b0cc1a3..936c71d 100644 > >> --- a/virt/kvm/Kconfig > >> +++ b/virt/kvm/Kconfig > >> @@ -50,3 +50,7 @@ config KVM_COMPAT > >> > >> config HAVE_KVM_IRQ_BYPASS > >> bool > >> + > >> +config KVM_FREE_PAGE_HINTING > >> + def_bool y > >> + depends on KVM > >> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c > >> new file mode 100644 > >> index 0000000..39d2b1d > >> --- /dev/null > >> +++ b/virt/kvm/page_hinting.c > >> @@ -0,0 +1,46 @@ > >> +#include <linux/gfp.h> > >> +#include <linux/mm.h> > >> +#include <linux/page_ref.h> > >> +#include <linux/kvm_host.h> > >> +#include <linux/sort.h> > >> + > >> +#include <trace/events/kvm.h> > >> + > >> +#define MAX_FGPT_ENTRIES 1000 > >> +#define HYPERLIST_THRESHOLD 500 > >> +/* > >> + * struct kvm_free_pages - Tracks the pages which are freed by the guest. > >> + * @pfn - page frame number for the page which is to be freed > >> + * @pages - number of pages which are supposed to be freed. > >> + * A global array object is used to hold the list of pfn and number of pages > >> + * which are freed by the guest. This list may also have fragmentated pages so > >> + * defragmentation is a must prior to the hypercall. > >> + */ > >> +struct kvm_free_pages { > >> + unsigned long pfn; > >> + unsigned int pages; > >> +}; > >> + > >> +/* > >> + * hypervisor_pages - It is a dummy structure passed with the hypercall. > >> + * @pfn - page frame number for the page which is to be freed. > >> + * @pages - number of pages which are supposed to be freed. > >> + * A global array object is used to to hold the list of pfn and pages and is > >> + * passed as part of the hypercall. > >> + */ > >> +struct hypervisor_pages { > >> + unsigned long pfn; > >> + unsigned int pages; > >> +}; > >> + > >> +DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt); > >> +DEFINE_PER_CPU(int, kvm_pt_idx); > >> +struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES]; > >> + > >> +void arch_alloc_page(struct page *page, int order) > >> +{ > >> +} > >> + > >> +void arch_free_page(struct page *page, int order) > >> +{ > >> +} > >> -- > >> 2.9.4 > Thanks Michael for pointing out this bug. > The issue here is when both 'Guest Page Hinting' and 'Page Poisoning' > are enabled as it may result in guest generating memory corruption > errors when the hypervisor maps a different physical memory to the guest > after freeing the earlier mapped memory. > > Page Poisoning is a feature in which the page is filled with a specific > pattern of (0x00 or 0xaa) after arch_free_page and the same is verified > before arch_alloc_page to prevent following issues: > > *information leak from the freed data > *use after free bugs > *memory corruption > > -Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO > > Guest page hinting is a support for handing freed memory between the > guest and the host. It enables the guests with DAX (no page cache) to > rapidly free and reclaims memory to and from the host respectively. To > achieve this task it uses two lists one CPU-local and other CPU-global. > Whenever a page is freed it is added to the respective CPU-local list > through arch_free_page() until it is full. Once the list is full a > seqlock is taken to prevent any further page allocations and the per > CPU-local list is traversed and only those pages which are not > reallocated by the guest are added to the CPU-global list. This global > list is then sent to the hypervisor/host. > > After freeing the pages in the global list following things may happen: > > *Hypervisor reallocates the freed memory back to the guest if required > *Hypervisor frees the memory and maps a different physical memory at > its place > > In order to prevent any information leak hypervisor before allocating > memory to the guest fills it with zeroes. > The issue arises when the pattern used for Page Poisoning is 0xaa while > the newly allocated page received from the hypervisor by the guest is > filled with the pattern 0x00. This will result in memory corruption errors. > As per the current implementation in order to enable page poisoning a > boot time parameter (page_poison) is set. Hence there could be two > following ways by which we could resolve the above-stated issue: > > 1. We can disable page poisoning if guest page hinting is enabled. The > only code change required in this case will be to check if > CONFIG_PAGE_POISONING is enabled, if so disable it so that poisoning is > not checked once we receive newly allocated memory from the hypervisor. > > 2. We can have a flag in the page structure using which we can control > poisoning on a per page basis. Using this we can disable the poisoning > for only those pages which are sent to the hypervisor so that posioning > is not checked on the newly allocated memory which is received by the > guest form the hypervisor. Though in order to incorporate this I may > have to make changes in page_poison.c, mm_types.h and other related files. > > I am looking forward to the comments and suggestions. > > [1] http://www.spinics.net/lists/kvm/msg153666.html > [2] https://www.spinics.net/lists/kvm/msg158054.html Let's not discuss this off-list. Please repost with Cc a mailing list so the discussion is archived. Thanks! > -- > Regards > Nitesh >