On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote: > Hi David, > > > > The first patch ensures that the mappings needed for handling mmap > > > operation would be managed by using the pfn instead of struct page. > > > The second patch restores support for mapping hugetlb pages where > > > subpages of a hugepage are not directly used anymore (main reason > > > for revert) and instead the hugetlb pages and the relevant offsets > > > are used to populate the scatterlist for dma-buf export and for > > > mmap operation. > > > > > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 > > options > > > were passed to the Host kernel and Qemu was launched with these > > > relevant options: qemu-system-x86_64 -m 4096m.... > > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080 > > > -display gtk,gl=on > > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M > > > -machine memory-backend=mem1 > > > > > > Replacing -display gtk,gl=on with -display gtk,gl=off above would > > > exercise the mmap handler. > > > > > > > While I think the VM_PFNMAP approach is much better and should fix that > > issue at hand, I thought more about missing memlock support and realized > > that we might have to fix something else. SO I'm going to raise the > > issue here. > > > > I think udmabuf chose the wrong interface to do what it's doing, that > > makes it harder to fix it eventually. > > > > Instead of accepting a range in a memfd, it should just have accepted a > > user space address range and then used > > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages > > "officially". > Udmabuf indeed started off by using user space address range and GUP but > the dma-buf subsystem maintainer had concerns with that approach in v2. > It also had support for mlock in that version. Here is v2 and the relevant > conversation: > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2 > > > > > So what's the issue? Udma effectively pins pages longterm ("possibly > > forever") simply by grabbing a reference on them. These pages might > > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks. > > > > So what udmabuf does is break memory hotunplug and CMA, because it > > turns > > pages that have to remain movable unmovable. > > > > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate > > these > > pages. See mm/gup.c:check_and_migrate_movable_pages() and especially > > folio_is_longterm_pinnable(). We'd probably have to implement something > > similar for udmabuf, where we detect such unpinnable pages and migrate > > them. > The pages udmabuf pins are only those associated with Guest (GPU driver/virtio-gpu) > resources (or buffers allocated and pinned from shmem via drm GEM). Some > resources are short-lived, and some are long-lived and whenever a resource > gets destroyed, the pages are unpinned. And, not all resources have their pages > pinned. The resource that is pinned for the longest duration is the FB and that's > because it is updated every ~16ms (assuming 1920x1080@60) by the Guest > GPU driver. We can certainly pin/unpin the FB after it is accessed on the Host > as a workaround, but I guess that may not be very efficient given the amount > of churn it would create. > > Also, as far as migration or S3/S4 is concerned, my understanding is that all > the Guest resources are destroyed and recreated again. So, wouldn't something > similar happen during memory hotunplug? > > > > > > > For example, pairing udmabuf with vfio (which pins pages using > > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in > > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio > > will fail to migrate them during FOLL_LONGTERM and consequently fail > > pin_user_pages(). As long as udmabuf holds a reference on these pages, > > that will never succeed. > Dma-buf rules (for exporters) indicate that the pages only need to be pinned > during the map_attachment phase (and until unmap attachment happens). > In other words, only when the sg_table is created by udmabuf. I guess one > option would be to not hold any references during UDMABUF_CREATE and > only grab references to the pages (as and when it gets used) during this step. > Would this help? IIUC the refcount is needed, otherwise I don't see what to protect the page from being freed and even reused elsewhere before map_attachment(). It seems the previous concern on using gup was majorly fork(), if this is it: https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#comment_414213 Could it also be guarded by just making sure the pages are MAP_SHARED when creating the udmabuf, if fork() is a requirement of the feature? I had a feeling that the userspace still needs to always do the right thing to make it work, even using pure PFN mappings. For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but at least not F_SEAL_WRITE with current impl), and fault a new page into the page cache? Thanks, > > > > > > > There are *probably* more issues on the QEMU side when udmabuf is > > paired > > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for > > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in > > the vfio/vdpa case we make sure that we disallow most of these, because > > otherwise there can be an accidental "disconnect" between the pages > > mapped into the VM (guest view) and the pages mapped into the IOMMU > > (device view), for example, after a reboot. > Ok; I am not sure if I can figure out if there is any acceptable way to address > these issues but given the current constraints associated with udmabuf, what > do you suggest is the most reasonable way to deal with these problems you > have identified? > > Thanks, > Vivek > > > > > -- > > Cheers, > > > > David / dhildenb > -- Peter Xu