On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote: > Observed the following oops while stress-testing, using multiple > secureVM on a distro kernel. However this issue theoritically exists in > 5.5 kernel and later. > > This issue occurs when the total number of requested device-PFNs exceed > the total-number of available device-PFNs. PFN migration fails to > allocate a device-pfn, which causes migrate_vma_finalize() to trigger > kvmppc_uvmem_page_free() on a page, that is not associated with any > device-pfn. kvmppc_uvmem_page_free() blindly tries to access the > contents of the private data which can be null, leading to the following > kernel fault. > > -------------------------------------------------------------------------- > Unable to handle kernel paging request for data at address 0x00000011 > Faulting instruction address: 0xc00800000e36e110 > Oops: Kernel access of bad area, sig: 11 [#1] > LE SMP NR_CPUS=2048 NUMA PowerNV > .... > MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> > CR: 24424822 XER: 00000000 > CFAR: c000000000e3d764 DAR: 0000000000000011 DSISR: 40000000 IRQMASK: 0 > GPR00: c00800000e36e0a4 c000001f1d59f610 c00800000e38a400 0000000000000000 > GPR04: c000001fa5000000 fffffffffffffffe ffffffffffffffff c000201fffeaf300 > GPR08: 00000000000001f0 0000000000000000 0000000000000f80 c00800000e373608 > GPR12: c000000000e3d710 c000201fffeaf300 0000000000000001 00007fef87360000 > GPR16: 00007fff97db4410 c000201c3b66a578 ffffffffffffffff 0000000000000000 > GPR20: 0000000119db9ad0 000000000000000a fffffffffffffffc 0000000000000001 > GPR24: c000201c3b660000 c000001f1d59f7a0 c0000000004cffb0 0000000000000001 > GPR28: 0000000000000000 c00a001ff003e000 c00800000e386150 0000000000000f80 > NIP [c00800000e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv] > LR [c00800000e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv] > Call Trace: > [c000000000512010] free_devmap_managed_page+0xd0/0x100 > [c0000000003f71d0] put_devmap_managed_page+0xa0/0xc0 > [c0000000004d24bc] migrate_vma_finalize+0x32c/0x410 > [c00800000e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv] > [c00800000e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv] > [c00800000e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv] > [c00800000e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv] > [c00800000e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv] > [c00800000e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm] > [c00800000e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm] > [c00800000e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm] > [c0000000005451d0] do_vfs_ioctl+0xe0/0xaa0 > [c000000000545d64] sys_ioctl+0xc4/0x160 > [c00000000000b408] system_call+0x5c/0x70 > Instruction dump: > a12d1174 2f890000 409e0158 a1271172 3929ffff b1271172 7c2004ac 39200000 > 913e0140 39200000 e87d0010 f93d0010 <89230011> e8c30000 e9030008 2f890000 > -------------------------------------------------------------------------- > > Fix the oops.. > > fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests") > Signed-off-by: Ram Pai <linuxram@xxxxxxxxxx> > --- > arch/powerpc/kvm/book3s_hv_uvmem.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c > index 2806983..f4002bf 100644 > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c > @@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page *page) > { > unsigned long pfn = page_to_pfn(page) - > (kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT); > - struct kvmppc_uvmem_page_pvt *pvt; > + struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data; > + > + if (!pvt) > + return; > > spin_lock(&kvmppc_uvmem_bitmap_lock); > bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1); > spin_unlock(&kvmppc_uvmem_bitmap_lock); > > - pvt = page->zone_device_data; > page->zone_device_data = NULL; > if (pvt->remove_gfn) > kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm); In our case, device pages that are in use are always associated with a valid pvt member. See kvmppc_uvmem_get_page() which returns failure if it runs out of device pfns and that will result in proper failure of page-in calls. For the case where we run out of device pfns, migrate_vma_finalize() will restore the original PTE and will not replace the PTE with device private PTE. Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never called for non-device-private pages. This could be a use-after-free case possibly arising out of the new state changes in HV. If so, this fix will only mask the bug and not address the original problem. Regards, Bharata.