On 30.08.22 17:11, Alex Williamson wrote: > On Tue, 30 Aug 2022 09:59:33 +0200 > David Hildenbrand <david@xxxxxxxxxx> wrote: > >> On 30.08.22 05:05, Alex Williamson wrote: >>> There's currently a reference count leak on the zero page. We increment >>> the reference via pin_user_pages_remote(), but the page is later handled >>> as an invalid/reserved page, therefore it's not accounted against the >>> user and not unpinned by our put_pfn(). >>> >>> Introducing special zero page handling in put_pfn() would resolve the >>> leak, but without accounting of the zero page, a single user could >>> still create enough mappings to generate a reference count overflow. >>> >>> The zero page is always resident, so for our purposes there's no reason >>> to keep it pinned. Therefore, add a loop to walk pages returned from >>> pin_user_pages_remote() and unpin any zero pages. >>> >>> Cc: David Hildenbrand <david@xxxxxxxxxx> >>> Cc: stable@xxxxxxxxxxxxxxx >>> Reported-by: Luboslav Pivarc <lpivarc@xxxxxxxxxx> >>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> >>> --- >>> drivers/vfio/vfio_iommu_type1.c | 12 ++++++++++++ >>> 1 file changed, 12 insertions(+) >>> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>> index db516c90a977..8706482665d1 100644 >>> --- a/drivers/vfio/vfio_iommu_type1.c >>> +++ b/drivers/vfio/vfio_iommu_type1.c >>> @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, >>> ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM, >>> pages, NULL, NULL); >>> if (ret > 0) { >>> + int i; >>> + >>> + /* >>> + * The zero page is always resident, we don't need to pin it >>> + * and it falls into our invalid/reserved test so we don't >>> + * unpin in put_pfn(). Unpin all zero pages in the batch here. >>> + */ >>> + for (i = 0 ; i < ret; i++) { >>> + if (unlikely(is_zero_pfn(page_to_pfn(pages[i])))) >>> + unpin_user_page(pages[i]); >>> + } >>> + >>> *pfn = page_to_pfn(pages[0]); >>> goto done; >>> } >>> >>> >> >> As discussed offline, for the shared zeropage (that's not even >> refcounted when mapped into a process), this makes perfect sense to me. >> >> Good question raised by Sean if ZONE_DEVICE pages might similarly be >> problematic. But for them, we cannot simply always unpin here. > > What sort of VM mapping would give me ZONE_DEVICE pages? Thanks, I think one approach is mmap'ing a devdax device. To test without actual NVDIMM hardware, there are ways to simulate it even on bare metal using the "memmap=" kernel parameter. https://nvdimm.wiki.kernel.org/ Alternatively, you can use an emulated nvdimm device under QEMU -- but then you'd have to run VFIO inside the VM. I know (that you know) that there are ways to get that working, but it certainly requires more effort :) ... let me know if you need any tips&tricks. -- Thanks, David / dhildenb