On Mon, 2019-02-04 at 17:46 -0800, Nadav Amit wrote: > > On Feb 4, 2019, at 4:16 PM, Alexander Duyck <alexander.duyck@xxxxxxxxx> wrote: > > > > On Mon, Feb 4, 2019 at 4:03 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > > > On Feb 4, 2019, at 3:37 PM, Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> wrote: > > > > > > > > On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote: > > > > > > On Feb 4, 2019, at 10:15 AM, Alexander Duyck <alexander.duyck@xxxxxxxxx> wrote: > > > > > > > > > > > > From: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> > > > > > > > > > > > > Add guest support for providing free memory hints to the KVM hypervisor for > > > > > > freed pages huge TLB size or larger. I am restricting the size to > > > > > > huge TLB order and larger because the hypercalls are too expensive to be > > > > > > performing one per 4K page. Using the huge TLB order became the obvious > > > > > > choice for the order to use as it allows us to avoid fragmentation of higher > > > > > > order memory on the host. > > > > > > > > > > > > I have limited the functionality so that it doesn't work when page > > > > > > poisoning is enabled. I did this because a write to the page after doing an > > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting > > > > > > cycles to do so. > > > > > > > > > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> > > > > > > --- > > > > > > arch/x86/include/asm/page.h | 13 +++++++++++++ > > > > > > arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++ > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h > > > > > > index 7555b48803a8..4487ad7a3385 100644 > > > > > > --- a/arch/x86/include/asm/page.h > > > > > > +++ b/arch/x86/include/asm/page.h > > > > > > @@ -18,6 +18,19 @@ > > > > > > > > > > > > struct page; > > > > > > > > > > > > +#ifdef CONFIG_KVM_GUEST > > > > > > +#include <linux/jump_label.h> > > > > > > +extern struct static_key_false pv_free_page_hint_enabled; > > > > > > + > > > > > > +#define HAVE_ARCH_FREE_PAGE > > > > > > +void __arch_free_page(struct page *page, unsigned int order); > > > > > > +static inline void arch_free_page(struct page *page, unsigned int order) > > > > > > +{ > > > > > > + if (static_branch_unlikely(&pv_free_page_hint_enabled)) > > > > > > + __arch_free_page(page, order); > > > > > > +} > > > > > > +#endif > > > > > > > > > > This patch and the following one assume that only KVM should be able to hook > > > > > to these events. I do not think it is appropriate for __arch_free_page() to > > > > > effectively mean “kvm_guest_free_page()”. > > > > > > > > > > Is it possible to use the paravirt infrastructure for this feature, > > > > > similarly to other PV features? It is not the best infrastructure, but at least > > > > > it is hypervisor-neutral. > > > > > > > > I could probably tie this into the paravirt infrastructure, but if I > > > > did so I would probably want to pull the checks for the page order out > > > > of the KVM specific bits and make it something we handle in the inline. > > > > Doing that I would probably make it a paravirtual hint that only > > > > operates at the PMD level. That way we wouldn't incur the cost of the > > > > paravirt infrastructure at the per 4K page level. > > > > > > If I understand you correctly, you “complain” that this would affect > > > performance. > > > > It wasn't so much a "complaint" as an "observation". What I was > > getting at is that if I am going to make it a PV operation I might set > > a hard limit on it so that it will specifically only apply to huge > > pages and larger. By doing that I can justify performing the screening > > based on page order in the inline path and avoid any PV infrastructure > > overhead unless I have to incur it. > > I understood. I guess my use of “double quotes” was lost in translation. ;-) Yeah, I just figured I would restate it to make sure we were "on the same page". ;-) > One more point regarding [2/4] - you may want to consider using madvise_free > instead of madvise_dontneed to avoid unnecessary EPT violations. For now I am using MADVISE_DONTNEED because it reduces the complexity. I have been working on a proof of concept with MADVISE_FREE, however we then have to add some additional checks as MADVISE_FREE only works with anonymous memory if I am not mistaken.