Re: [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap

Daniel Vetter <daniel.vetter@xxxxxxxx> · Wed, 23 Nov 2022 15:28:27 +0100

On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote:
> > Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
> > > On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
> > > > Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
> > > > > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
> > > > >
> > > > > > > Maybe a GFP flag to set the page reference count to zero or something
> > > > > > > like this?
> > > > > > Hm yeah that might work. I'm not sure what it will all break though?
> > > > > > And we'd need to make sure that underflowing the page refcount dies in
> > > > > > a backtrace.
> > > > > Mucking with the refcount like this to protect against crazy out of
> > > > > tree drives seems horrible..
> > > > Well not only out of tree drivers. The intree KVM got that horrible
> > > > wrong as well, those where the latest guys complaining about it.
> > > kvm was taking refs on special PTEs? That seems really unlikely?
> >
> > Well then look at this code here:
> >
> > commit add6a0cd1c5ba51b201e1361b05a5df817083618
> > Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > Date:   Tue Jun 7 17:51:18 2016 +0200
> >
> >     KVM: MMU: try to fix up page faults before giving up
> >
> >     The vGPU folks would like to trap the first access to a BAR by setting
> >     vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault
> > handler
> >     then can use remap_pfn_range to place some non-reserved pages in the
> > VMA.
> >
> >     This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
> >     and fixup_user_fault together help supporting it.  The patch also
> > supports
> >     VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
> >     reference counting.
> >
> >     Cc: Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx>
> >     Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> >     Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx>
> >     Tested-by: Neo Jia <cjia@xxxxxxxxxx>
> >     Reported-by: Kirti Wankhede <kwankhede@xxxxxxxxxx>
> >     Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>
> This patch is known to be broken in so many ways. It also has a major
> security hole that it ignores the PTE flags making the page
> RO. Ignoring the special bit is somehow not surprising :(
>
> This probably doesn't work, but is the general idea of what KVM needs
> to do:

Oh dear, when I dug around in there I entirely missed that
kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs
to grow a proper mmu notifier.

Another thing I'm wondering right now, the follow_pte();
fixup_user_fault(); follow_pte(); approach does not make any
guarantees of actually being right. If you're sufficiently unlucky you
might race against an immediate pte invalidate between the fixup and
the 2nd follow_pte(). But you can also not loop, because that would
fail to catch permanent faults.

I think the iommu fault drivers have a similar pattern.

What am I missing here? Or is that also just broken. gup works around
this with the slow path that takes the mmap sem and walking the vma
tree, follow_pte/fixup_user_fautl users dont. Maybe mmu notifier based
restarting would help with this too, if done properly.
-Daniel

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1376a47fedeedb..4161241fc3228c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                         return r;
>         }
>
> +       /*
> +        * Special PTEs are never convertible into a struct page, even if the
> +        * driver that owns them might have put a PFN with a struct page into
> +        * the PFNMAP. If the arch doesn't support special then we cannot
> +        * safely process these pages.
> +        */
> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> +       if (pte_special(*ptep))
> +               return -EINVAL;
> +#else
> +       return -EINVAL;
> +#endif
> +
>         if (write_fault && !pte_write(*ptep)) {
>                 pfn = KVM_PFN_ERR_RO_FAULT;
>                 goto out;
>
> Jason

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch