On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote: > > Am 23.11.22 um 13:53 schrieb Jason Gunthorpe: > > > On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote: > > > > Am 23.11.22 um 13:46 schrieb Jason Gunthorpe: > > > > > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote: > > > > > > > > > > > > Maybe a GFP flag to set the page reference count to zero or something > > > > > > > like this? > > > > > > Hm yeah that might work. I'm not sure what it will all break though? > > > > > > And we'd need to make sure that underflowing the page refcount dies in > > > > > > a backtrace. > > > > > Mucking with the refcount like this to protect against crazy out of > > > > > tree drives seems horrible.. > > > > Well not only out of tree drivers. The intree KVM got that horrible > > > > wrong as well, those where the latest guys complaining about it. > > > kvm was taking refs on special PTEs? That seems really unlikely? > > > > Well then look at this code here: > > > > commit add6a0cd1c5ba51b201e1361b05a5df817083618 > > Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > Date: Tue Jun 7 17:51:18 2016 +0200 > > > > KVM: MMU: try to fix up page faults before giving up > > > > The vGPU folks would like to trap the first access to a BAR by setting > > vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault > > handler > > then can use remap_pfn_range to place some non-reserved pages in the > > VMA. > > > > This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn > > and fixup_user_fault together help supporting it. The patch also > > supports > > VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to > > reference counting. > > > > Cc: Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx> > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx> > > Tested-by: Neo Jia <cjia@xxxxxxxxxx> > > Reported-by: Kirti Wankhede <kwankhede@xxxxxxxxxx> > > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > This patch is known to be broken in so many ways. It also has a major > security hole that it ignores the PTE flags making the page > RO. Ignoring the special bit is somehow not surprising :( > > This probably doesn't work, but is the general idea of what KVM needs > to do: Oh dear, when I dug around in there I entirely missed that kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs to grow a proper mmu notifier. Another thing I'm wondering right now, the follow_pte(); fixup_user_fault(); follow_pte(); approach does not make any guarantees of actually being right. If you're sufficiently unlucky you might race against an immediate pte invalidate between the fixup and the 2nd follow_pte(). But you can also not loop, because that would fail to catch permanent faults. I think the iommu fault drivers have a similar pattern. What am I missing here? Or is that also just broken. gup works around this with the slow path that takes the mmap sem and walking the vma tree, follow_pte/fixup_user_fautl users dont. Maybe mmu notifier based restarting would help with this too, if done properly. -Daniel > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1376a47fedeedb..4161241fc3228c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > return r; > } > > + /* > + * Special PTEs are never convertible into a struct page, even if the > + * driver that owns them might have put a PFN with a struct page into > + * the PFNMAP. If the arch doesn't support special then we cannot > + * safely process these pages. > + */ > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > + if (pte_special(*ptep)) > + return -EINVAL; > +#else > + return -EINVAL; > +#endif > + > if (write_fault && !pte_write(*ptep)) { > pfn = KVM_PFN_ERR_RO_FAULT; > goto out; > > Jason -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch