On 02.09.22 10:24, Tian, Kevin wrote: > Hi, Alex, > >> From: Alex Williamson <alex.williamson@xxxxxxxxxx> >> Sent: Tuesday, August 30, 2022 11:06 AM >> >> There's currently a reference count leak on the zero page. We increment >> the reference via pin_user_pages_remote(), but the page is later handled >> as an invalid/reserved page, therefore it's not accounted against the >> user and not unpinned by our put_pfn(). >> >> Introducing special zero page handling in put_pfn() would resolve the >> leak, but without accounting of the zero page, a single user could >> still create enough mappings to generate a reference count overflow. >> >> The zero page is always resident, so for our purposes there's no reason >> to keep it pinned. Therefore, add a loop to walk pages returned from >> pin_user_pages_remote() and unpin any zero pages. >> > > We found an interesting issue on zero page and wonder whether we > should instead find a way to not use zero page in vfio pinning path. > > The observation - the 'pc.bios' region (0xfffc0000) is always mapped > RO to zero page in the IOMMU page table even after the mapping in > the CPU page table has been changed after Qemu loads the guest > bios image into that region (which is mmap'ed as RW). > > In reality this may not cause real problem as I don't expect any sane > usage would want to DMA read from the bios region. This is probably > the reason why nobody ever notes it. > > But in concept it is incorrect. > > Fixing Qemu to update/setup the VFIO mapping after loading the bios > image could mitigate this problem. But we never document such ABI > restriction on RO mappings and in concept the pinning semantics > should apply to all paths (RO in DMA and RW in CPU) which the > application uses to access the pinned memory hence the sequence > shouldn't matter from user p.o.v > > And old Qemu/VMM still have this issue. > > Having a notifier to implicitly fix the IOMMU mapping within the > kernel violates the semantics of pinning, and makes vfio page > accounting even more tricky. > > So I wonder instead of continuing to fix trickiness around the zero > page whether it is a better idea to pursue allocating a normal > page from the beginning for pinned RO mappings? That's precisely what I am working. For example, that's required to get rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA: See https://lore.kernel.org/all/5593cbb7-eb29-82f0-490e-dd72ceafff9b@xxxxxxxxxx/ for some more details. The concerns I discussed with Peter and Alex offline for VFIO is that we might end up charging more anon pages with that change to the MEMLOCK limit of the user and might degrade existing setups. I do wonder if that's a real issue, though. One approach would be to warn the VFIO users and allow for slightly exceeding the MEMLOCK limit for a while. Of course, that only works if we assume that such pinned zeropages are only extremely rarely longterm-pinned for a single VM instance by VFIO. -- Thanks, David / dhildenb