On Wed, Sep 07, 2022 at 11:00:21AM +0200, David Hildenbrand wrote: > > > I do wonder if that's a real issue, though. One approach would be to > > > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit > > > for a while. Of course, that only works if we assume that such pinned > > > zeropages are only extremely rarely longterm-pinned for a single VM > > > instance by VFIO. > > > > I'm confused, doesn't vfio increment the memlock for every page of VA > > it pins? Why would it matter if the page was COW'd or not? It is > > already accounted for today as though it was a unique page. > > > > IOW if we add FOLL_FORCE it won't change the value of the memlock. > > I only briefly skimmed over the code Alex might be able to provide more > details and correct me if I'm wrong: > > vfio_pin_pages_remote() contains a comment: > > "Reserved pages aren't counted against the user, externally pinned pages are > already counted against the user." > > is_invalid_reserved_pfn() should return "true" for the shared zeropage and > prevent us from accounting it via vfio_lock_acct(). Otherwise, > vfio_find_vpfn() seems to be in place to avoid double-accounting pages. is_invalid_reserved_pfn() is supposed to return 'true' for PFNs that cannot be returned from pin_user_pages(): /* * Some mappings aren't backed by a struct page, for example an mmap'd * MMIO range for our own or another device. These use a different * pfn conversion and shouldn't be tracked as locked pages. * For compound pages, any driver that sets the reserved bit in head * page needs to set the reserved bit in all subpages to be safe. */ static bool is_invalid_reserved_pfn(unsigned long pfn) What it is talking about by 'different pfn conversion' is the follow_fault_pfn() path, not the PUP path. So, it is some way for VFIO to keep track of when a pfn was returned by PUP vs follow_fault_pfn(), because it treats those two paths quite differently. I lost track of what the original cause of this bug is - however AFAIK pin_user_pages() used to succeed when the zero page is mapped. No other PUP user call this follow_fault_pfn() hacky path, and we expect things like O_DIRECT to work properly even when reading from VA that has the zero page mapped. So, if we go back far enough in the git history we will find a case where PUP is returning something for the zero page, and that something caused is_invalid_reserved_pfn() == false since VFIO did work at some point. IHMO we should simply go back to the historical behavior - make is_invalid_reserved_pfn() check for the zero_pfn and return false. Meaning that PUP returned it. Jason