Hi Michael, On 11/13/2017 12:59 PM, Michael S. Tsirkin wrote: > On Fri, Nov 03, 2017 at 04:30:08PM -0400, nilal@xxxxxxxxxx wrote: >> From: Nitesh Narayan Lal <nilal@xxxxxxxxxx> >> >> This patch includes the following: >> 1. Basic skeleton for the support >> 2. Enablement of x86 platform to use the same >> >> Signed-off-by: Nitesh Narayan Lal <nilal@xxxxxxxxxx> >> --- >> arch/x86/Kbuild | 2 +- >> arch/x86/kvm/Makefile | 2 ++ >> include/linux/gfp.h | 7 +++++++ >> virt/kvm/Kconfig | 4 ++++ >> virt/kvm/page_hinting.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ >> 5 files changed, 60 insertions(+), 1 deletion(-) >> create mode 100644 virt/kvm/page_hinting.c >> >> diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild >> index 0038a2d..7d39d7d 100644 >> --- a/arch/x86/Kbuild >> +++ b/arch/x86/Kbuild >> @@ -2,7 +2,7 @@ obj-y += entry/ >> >> obj-$(CONFIG_PERF_EVENTS) += events/ >> >> -obj-$(CONFIG_KVM) += kvm/ >> +obj-$(subst m,y,$(CONFIG_KVM)) += kvm/ >> >> # Xen paravirtualization support >> obj-$(CONFIG_XEN) += xen/ >> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile >> index 09d4b17..d8a3800 100644 >> --- a/arch/x86/kvm/Makefile >> +++ b/arch/x86/kvm/Makefile >> @@ -15,6 +15,8 @@ kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ >> i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \ >> hyperv.o page_track.o debugfs.o >> >> +obj-$(CONFIG_KVM_FREE_PAGE_HINTING) += $(KVM)/page_hinting.o >> + >> kvm-intel-y += vmx.o pmu_intel.o >> kvm-amd-y += svm.o pmu_amd.o >> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >> index f780718..a74371f 100644 >> --- a/include/linux/gfp.h >> +++ b/include/linux/gfp.h >> @@ -452,6 +452,13 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags) >> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); >> } >> >> +#ifdef CONFIG_KVM_FREE_PAGE_HINTING >> +#define HAVE_ARCH_ALLOC_PAGE >> +#define HAVE_ARCH_FREE_PAGE >> +void arch_free_page(struct page *page, int order); >> +void arch_alloc_page(struct page *page, int order); >> +#endif >> + >> #ifndef HAVE_ARCH_FREE_PAGE >> static inline void arch_free_page(struct page *page, int order) { } >> #endif > > So the idea is to send pages to host in arch_free_page? > > However, I notice: > > arch_free_page(page, order); > kernel_poison_pages(page, 1 << order, 0); > > Thus pages are immediately written to after they are freed. > > > > > >> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig >> index b0cc1a3..936c71d 100644 >> --- a/virt/kvm/Kconfig >> +++ b/virt/kvm/Kconfig >> @@ -50,3 +50,7 @@ config KVM_COMPAT >> >> config HAVE_KVM_IRQ_BYPASS >> bool >> + >> +config KVM_FREE_PAGE_HINTING >> + def_bool y >> + depends on KVM >> diff --git a/virt/kvm/page_hinting.c b/virt/kvm/page_hinting.c >> new file mode 100644 >> index 0000000..39d2b1d >> --- /dev/null >> +++ b/virt/kvm/page_hinting.c >> @@ -0,0 +1,46 @@ >> +#include <linux/gfp.h> >> +#include <linux/mm.h> >> +#include <linux/page_ref.h> >> +#include <linux/kvm_host.h> >> +#include <linux/sort.h> >> + >> +#include <trace/events/kvm.h> >> + >> +#define MAX_FGPT_ENTRIES 1000 >> +#define HYPERLIST_THRESHOLD 500 >> +/* >> + * struct kvm_free_pages - Tracks the pages which are freed by the guest. >> + * @pfn - page frame number for the page which is to be freed >> + * @pages - number of pages which are supposed to be freed. >> + * A global array object is used to hold the list of pfn and number of pages >> + * which are freed by the guest. This list may also have fragmentated pages so >> + * defragmentation is a must prior to the hypercall. >> + */ >> +struct kvm_free_pages { >> + unsigned long pfn; >> + unsigned int pages; >> +}; >> + >> +/* >> + * hypervisor_pages - It is a dummy structure passed with the hypercall. >> + * @pfn - page frame number for the page which is to be freed. >> + * @pages - number of pages which are supposed to be freed. >> + * A global array object is used to to hold the list of pfn and pages and is >> + * passed as part of the hypercall. >> + */ >> +struct hypervisor_pages { >> + unsigned long pfn; >> + unsigned int pages; >> +}; >> + >> +DEFINE_PER_CPU(struct kvm_free_pages [MAX_FGPT_ENTRIES], kvm_pt); >> +DEFINE_PER_CPU(int, kvm_pt_idx); >> +struct hypervisor_pages hypervisor_pagelist[MAX_FGPT_ENTRIES]; >> + >> +void arch_alloc_page(struct page *page, int order) >> +{ >> +} >> + >> +void arch_free_page(struct page *page, int order) >> +{ >> +} >> -- >> 2.9.4 Thanks Michael for pointing out this bug. The issue here is when both 'Guest Page Hinting' and 'Page Poisoning' are enabled as it may result in guest generating memory corruption errors when the hypervisor maps a different physical memory to the guest after freeing the earlier mapped memory. Page Poisoning is a feature in which the page is filled with a specific pattern of (0x00 or 0xaa) after arch_free_page and the same is verified before arch_alloc_page to prevent following issues: *information leak from the freed data *use after free bugs *memory corruption -Selection of the pattern depends on the CONFIG_PAGE_POISONING_ZERO Guest page hinting is a support for handing freed memory between the guest and the host. It enables the guests with DAX (no page cache) to rapidly free and reclaims memory to and from the host respectively. To achieve this task it uses two lists one CPU-local and other CPU-global. Whenever a page is freed it is added to the respective CPU-local list through arch_free_page() until it is full. Once the list is full a seqlock is taken to prevent any further page allocations and the per CPU-local list is traversed and only those pages which are not reallocated by the guest are added to the CPU-global list. This global list is then sent to the hypervisor/host. After freeing the pages in the global list following things may happen: *Hypervisor reallocates the freed memory back to the guest if required *Hypervisor frees the memory and maps a different physical memory at its place In order to prevent any information leak hypervisor before allocating memory to the guest fills it with zeroes. The issue arises when the pattern used for Page Poisoning is 0xaa while the newly allocated page received from the hypervisor by the guest is filled with the pattern 0x00. This will result in memory corruption errors. As per the current implementation in order to enable page poisoning a boot time parameter (page_poison) is set. Hence there could be two following ways by which we could resolve the above-stated issue: 1. We can disable page poisoning if guest page hinting is enabled. The only code change required in this case will be to check if CONFIG_PAGE_POISONING is enabled, if so disable it so that poisoning is not checked once we receive newly allocated memory from the hypervisor. 2. We can have a flag in the page structure using which we can control poisoning on a per page basis. Using this we can disable the poisoning for only those pages which are sent to the hypervisor so that posioning is not checked on the newly allocated memory which is received by the guest form the hypervisor. Though in order to incorporate this I may have to make changes in page_poison.c, mm_types.h and other related files. I am looking forward to the comments and suggestions. [1] http://www.spinics.net/lists/kvm/msg153666.html [2] https://www.spinics.net/lists/kvm/msg158054.html -- Regards Nitesh
Attachment:
signature.asc
Description: OpenPGP digital signature