On 16.08.24 16:21, Peter Xu wrote:
On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
On 14.08.24 15:05, Jason Gunthorpe wrote:
On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
That is in general not what we want, and we still have some places that
wrongly hard-code that behavior.
In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
vm_normal_page_pud()] should be able to identify PFN maps and reject them,
no?
Yep, I think we can also rely on special bit.
It is more than just relying on the special bit..
VM_PFNMAP/VM_MIXEDMAP should really only be used inside
vm_normal_page() because thay are, effectively, support for a limited
emulation of the special bit on arches that don't have them. There are
a bunch of weird rules that are used to try and make that work
properly that have to be followed.
On arches with the sepcial bit they should possibly never be checked
since the special bit does everything you need.
Arguably any place reading those flags out side of vm_normal_page/etc
is suspect.
IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
GUP-fast is special (one of the reason for "pte_special()" and friends after
all).
The issue is at least GUP currently doesn't work with pfnmaps, while
there're potentially users who wants to be able to work on both page +
!page use cases. Besides access_process_vm(), KVM also uses similar thing,
and maybe more; these all seem to be valid use case of reference the vma
flags for PFNMAP and such, so they can identify "it's pfnmap" or more
generic issues like "permission check error on pgtable".
What at least VFIO does is first try GUP, and if that fails, try
follow_fault_pfn()->follow_pte(). There is a VM_PFNMAP check in there, yes.
Ideally, follow_pte() would never return refcounted/normal pages, then
the PFNMAP check might only be a performance improvement (maybe).
The whole private mapping thing definitely made it complicated.
Yes, and follow_pte() for now could even return ordinary anon pages. I
spotted that when I was working on that VM_PAT stuff, but I was to
unsure what to do (see below that KVM with MAP_PRIVATE /dev/mem might
just work, no idea if there are use cases?).
Fortunately, vfio calls is_invalid_reserved_pfn() and refuses anything
that has a struct page.
I think KVM does something nasty: if it something with a "struct page",
and it's not PageReserved, it would take a reference (if I get
kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not
normal" page -- it essentially ignores the vm_normal_page() information
in the page tables ...
So anon pages in pivate mappings from follow_pte() might currently work
with KVM ... because of the way KVM uses follow_pte().
I did not play with it, so I'm not sure if I am missing some detail.
--
Cheers,
David / dhildenb