Re: [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap

Daniel Vetter <daniel.vetter@xxxxxxxx> · Wed, 23 Nov 2022 17:26:23 +0100

On Wed, 23 Nov 2022 at 16:15, Christian König <christian.koenig@xxxxxxx> wrote:
>
> Am 23.11.22 um 16:08 schrieb Jason Gunthorpe:
> > On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote:
> >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>> index 1376a47fedeedb..4161241fc3228c 100644
> >>> --- a/virt/kvm/kvm_main.c
> >>> +++ b/virt/kvm/kvm_main.c
> >>> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >>>                          return r;
> >>>          }
> >>>
> >>> +       /*
> >>> +        * Special PTEs are never convertible into a struct page, even if the
> >>> +        * driver that owns them might have put a PFN with a struct page into
> >>> +        * the PFNMAP. If the arch doesn't support special then we cannot
> >>> +        * safely process these pages.
> >>> +        */
> >>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> >>> +       if (pte_special(*ptep))
> >>> +               return -EINVAL;
> >> On second thought this wont work, because it completely defeats the
> >> point of why this code here exists. remap_pfn_range() (which is what
> >> the various dma_mmap functions and the ioremap functions are built on
> >> top of too) sets VM_PFNMAP too, so this check would even catch the
> >> static mappings.
> > The problem with the way this code is designed is how it allows
> > returning the pfn without taking any reference based on things like
> > !pfn_valid or page_reserved. This allows it to then conditionally put
> > back the reference based on the same reasoning. It is impossible to
> > thread pte special into that since it is a PTE flag, not a property of
> > the PFN.
> >
> > I don't entirely understand why it needs the page reference at all,
>
> That's exactly what I've pointed out in the previous discussion about
> that code as well.
>
> As far as I can see it this is just another case where people assumed
> that grabbing a page reference somehow magically prevents the pte from
> changing.
>
> I have not the slightest idea how people got this impression, but I have
> heard it so many time from so many different sources that there must be
> some common cause to this. Is the maybe some book or tutorial how to
> sophisticate break the kernel or something like this?

It's what get_user_pages does, so it does "work". Except this path
here is the fallback for when get_user_pages does not work (because of
the pte_special/VM_SPECIAL case). So essentially it's just a rather
broken get_user_pages that handrolls a bunch of things with
bugs&races.

I have no idea why people don't realize they're just reinventing gup
without using gup, but that's essentially what's going on.

> Anyway as far as I can see only correct approach would be to use an MMU
> notifier or more high level hmm_range_fault()+seq number.

Yeah, plus if you go through ptes you really have to obey all the
flags or things will break. Especially the RO pte flag.
-Daniel

>
> Regards,
> Christian.
>
> > even if it is available - so I can't guess why it is OK to ignore the
> > page reference in other cases, or why it is OK to be racy..
> >
> > Eg hmm_range_fault() does not obtain page references and implements a
> > very similar algorithm to kvm.
> >
> >> Plus these static mappings aren't all that static either, e.g. pci
> >> access also can revoke bar mappings nowadays.
> > And there are already mmu notifiers to handle that, AFAIK.
> >
> > Jason
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch