On Wed, 23 Nov 2022 at 16:15, Christian König <christian.koenig@xxxxxxx> wrote: > > Am 23.11.22 um 16:08 schrieb Jason Gunthorpe: > > On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote: > >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > >>> index 1376a47fedeedb..4161241fc3228c 100644 > >>> --- a/virt/kvm/kvm_main.c > >>> +++ b/virt/kvm/kvm_main.c > >>> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > >>> return r; > >>> } > >>> > >>> + /* > >>> + * Special PTEs are never convertible into a struct page, even if the > >>> + * driver that owns them might have put a PFN with a struct page into > >>> + * the PFNMAP. If the arch doesn't support special then we cannot > >>> + * safely process these pages. > >>> + */ > >>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > >>> + if (pte_special(*ptep)) > >>> + return -EINVAL; > >> On second thought this wont work, because it completely defeats the > >> point of why this code here exists. remap_pfn_range() (which is what > >> the various dma_mmap functions and the ioremap functions are built on > >> top of too) sets VM_PFNMAP too, so this check would even catch the > >> static mappings. > > The problem with the way this code is designed is how it allows > > returning the pfn without taking any reference based on things like > > !pfn_valid or page_reserved. This allows it to then conditionally put > > back the reference based on the same reasoning. It is impossible to > > thread pte special into that since it is a PTE flag, not a property of > > the PFN. > > > > I don't entirely understand why it needs the page reference at all, > > That's exactly what I've pointed out in the previous discussion about > that code as well. > > As far as I can see it this is just another case where people assumed > that grabbing a page reference somehow magically prevents the pte from > changing. > > I have not the slightest idea how people got this impression, but I have > heard it so many time from so many different sources that there must be > some common cause to this. Is the maybe some book or tutorial how to > sophisticate break the kernel or something like this? It's what get_user_pages does, so it does "work". Except this path here is the fallback for when get_user_pages does not work (because of the pte_special/VM_SPECIAL case). So essentially it's just a rather broken get_user_pages that handrolls a bunch of things with bugs&races. I have no idea why people don't realize they're just reinventing gup without using gup, but that's essentially what's going on. > Anyway as far as I can see only correct approach would be to use an MMU > notifier or more high level hmm_range_fault()+seq number. Yeah, plus if you go through ptes you really have to obey all the flags or things will break. Especially the RO pte flag. -Daniel > > Regards, > Christian. > > > even if it is available - so I can't guess why it is OK to ignore the > > page reference in other cases, or why it is OK to be racy.. > > > > Eg hmm_range_fault() does not obtain page references and implements a > > very similar algorithm to kvm. > > > >> Plus these static mappings aren't all that static either, e.g. pci > >> access also can revoke bar mappings nowadays. > > And there are already mmu notifiers to handle that, AFAIK. > > > > Jason > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch