> On Nov 13, 2019, at 10:46 AM, Greg Kurz <groug@xxxxxxxx> wrote: > > The EQ page is allocated by the guest and then passed to the hypervisor > with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page > before handing it over to the HW. This reference is dropped either when > the guest issues the H_INT_RESET hcall or when the KVM device is released. > But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times, > either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot). > In both cases the existing EQ page reference is leaked because we simply > overwrite it in the XIVE queue structure without calling put_page(). > > This is especially visible when the guest memory is backed with huge pages: > start a VM up to the guest userspace, either reboot it or unplug a vCPU, > quit QEMU. The leak is observed by comparing the value of HugePages_Free in > /proc/meminfo before and after the VM is run. > > Ideally we'd want the XIVE code to handle the EQ page de-allocation at the > platform level. This isn't the case right now because the various XIVE > drivers have different allocation needs. It could maybe worth introducing > hooks for this purpose instead of exposing XIVE internals to the drivers, > but this is certainly a huge work to be done later. > > In the meantime, for easier backport, fix both vCPU unplug and guest reboot > leaks by introducing a wrapper around xive_native_configure_queue() that > does the necessary cleanup. > > Reported-by: Satheesh Rajendran <sathnaga@xxxxxxxxxxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # v5.2 > Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration") > Signed-off-by: Cédric Le Goater <clg@xxxxxxxx> > Signed-off-by: Greg Kurz <groug@xxxxxxxx> Tested-by: Lijun Pan <ljp@xxxxxxxxxxxxx> > --- > v2: use wrapper as suggested by Cedric > --- > arch/powerpc/kvm/book3s_xive_native.c | 31 ++++++++++++++++++++++--------- > 1 file changed, 22 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c > index 34bd123fa024..0e1fc5a16729 100644 > --- a/arch/powerpc/kvm/book3s_xive_native.c > +++ b/arch/powerpc/kvm/book3s_xive_native.c > @@ -50,6 +50,24 @@ static void kvmppc_xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio) > } > } > > +static int kvmppc_xive_native_configure_queue(u32 vp_id, struct xive_q *q, > + u8 prio, __be32 *qpage, > + u32 order, bool can_escalate) > +{ > + int rc; > + __be32 *qpage_prev = q->qpage; > + > + rc = xive_native_configure_queue(vp_id, q, prio, qpage, order, > + can_escalate); > + if (rc) > + return rc; > + > + if (qpage_prev) > + put_page(virt_to_page(qpage_prev)); > + > + return rc; > +} > + > void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) > { > struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; > @@ -575,19 +593,14 @@ static int kvmppc_xive_native_set_queue_config(struct kvmppc_xive *xive, > q->guest_qaddr = 0; > q->guest_qshift = 0; > > - rc = xive_native_configure_queue(xc->vp_id, q, priority, > - NULL, 0, true); > + rc = kvmppc_xive_native_configure_queue(xc->vp_id, q, priority, > + NULL, 0, true); > if (rc) { > pr_err("Failed to reset queue %d for VCPU %d: %d\n", > priority, xc->server_num, rc); > return rc; > } > > - if (q->qpage) { > - put_page(virt_to_page(q->qpage)); > - q->qpage = NULL; > - } > - > return 0; > } > > @@ -646,8 +659,8 @@ static int kvmppc_xive_native_set_queue_config(struct kvmppc_xive *xive, > * OPAL level because the use of END ESBs is not supported by > * Linux. > */ > - rc = xive_native_configure_queue(xc->vp_id, q, priority, > - (__be32 *) qaddr, kvm_eq.qshift, true); > + rc = kvmppc_xive_native_configure_queue(xc->vp_id, q, priority, > + (__be32 *) qaddr, kvm_eq.qshift, true); > if (rc) { > pr_err("Failed to configure queue %d for VCPU %d: %d\n", > priority, xc->server_num, rc); >