> Am 20.11.2020 um 21:28 schrieb Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>: > > Recently, I encountered a hang that is happening during memory hot > remove operation. It turns out that the hang is caused by pinned user > pages in ZONE_MOVABLE. > > Kernel expects that all pages in ZONE_MOVABLE can be migrated, but > this is not the case if a user applications such as through dpdk > libraries pinned them via vfio dma map. Kernel keeps trying to > hot-remove them, but refcnt never gets to zero, so we are looping > until the hardware watchdog kicks in. > > We cannot do dma unmaps before hot-remove, because hot-remove is a > slow operation, and we have thousands for network flows handled by > dpdk that we just cannot suspend for the duration of hot-remove > operation. > Hi! It‘s a known problem also for VMs using vfio. I thought about this some while ago an came to the same conclusion: before performing long-term pinnings, we have to migrate pages off the movable zone. After that, it‘s too late. What happens when we can‘t migrate (OOM on !MOVABLE memory, short-term pinning)? TBD. > The solution is for dpdk to allocate pages from a zone below > ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible. > There is no user interface that we have that allows applications to > select what zone the memory should come from. > > I've spoken with Stephen Hemminger, and he said that DPDK is moving in > the direction of using transparent huge pages instead of HugeTLBs, > which means that we need to allow at least anonymous, and anonymous > transparent huge pages to come from non-movable zones on demand. > > Here is what I am proposing: > 1. Add a new flag that is passed through pin_user_pages_* down to > fault handlers, and allow the fault handler to allocate from a > non-movable zone. > > Sample function stacks through which this info needs to be passed is this: > > pin_user_pages_remote(gup_flags) > __get_user_pages_remote(gup_flags) > __gup_longterm_locked(gup_flags) > __get_user_pages_locked(gup_flags) > __get_user_pages(gup_flags) > faultin_page(gup_flags) > Convert gup_flags into fault_flags > handle_mm_fault(fault_flags) > > From handle_mm_fault(), the stack diverges into various faults, > examples include: > > Transparent Huge Page > handle_mm_fault(fault_flags) > __handle_mm_fault(fault_flags) > Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask > create_huge_pmd(vmf); > do_huge_pmd_anonymous_page(vmf); > mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from > vmf.gfp_mask should be passed as well. > > There are several other similar paths in a transparent huge page, also > there is a named path where allocation is based on filesystems, and > the flag should be honored there as well, but it does not have to be > added at the same time. > > Regular Pages > handle_mm_fault(fault_flags) > __handle_mm_fault(fault_flags) > Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask > handle_pte_fault(vmf) > do_anonymous_page(vmf); > page = alloc_zeroed_user_highpage_movable(vma, vmf->address); -> > replace change this call according to gfp_mask. > > The above only take care of the case if user application faults on the > page during pinning time, but there are also cases where pages already > exist. > > 2. Add an internal move_pages_zone() similar to move_pages() syscall > but instead of migrating to a different NUMA node, migrate pages from > ZONE_MOVABLE to another zone. > Call move_pages_zone() on demand prior to pinning pages from > vfio_pin_map_dma() for instance. > > 3. Perhaps, it also makes sense to add madvise() flag, to allocate > pages from non-movable zone. When a user application knows that it > will do DMA mapping, and pin pages for a long time, the memory that it > allocates should never be migrated or hot-removed, so make sure that > it comes from the appropriate place. > The benefit of adding madvise() flag is that we won't have to deal > with slow page migration during pin time, but the disadvantage is that > we would need to change the user interface. > Hm, I am not sure we want to expose these details. What would be the semantics? „Might pin“? Hm, not sure. Assume you start a fresh VM via QEMU with vfio. When we start mapping guest memory via vfio, that‘s usually the time memory will get populated. Not really much has to be migrated. I think this is even true during live migration. I think selective DMA pinning (e.g., vIOMMU in QEMU) is different, where we keep pinning/unpinning on demand. But I guess even here, we will often reuse some pages over and over again. > Before I start working on the above approaches, I would like to get an > opinion from the community on an appropriate path forward for this > problem. If what I described sounds reasonable, or if there are other > ideas on how to address the problem that I am seeing. At least 1 and 2 sound sane. 3 is TBD - but it‘s a pure optimization, so it can wait. Thanks! > > Thank you, > Pasha >