On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote: > On 15.08.23 04:34, John Hubbard wrote: > > On 8/14/23 02:09, Yan Zhao wrote: > > ... > > > > hmm_range_fault()-based memory management in particular might benefit > > > > from having NUMA balancing disabled entirely for the memremap_pages() > > > > region, come to think of it. That seems relatively easy and clean at > > > > first glance anyway. > > > > > > > > For other regions (allocated by the device driver), a per-VMA flag > > > > seems about right: VM_NO_NUMA_BALANCING ? > > > > > > > Thanks a lot for those good suggestions! > > > For VMs, when could a per-VMA flag be set? > > > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > > > after it's mapped into VFIO. > > > Then, should VFIO set this flag on after it maps a range? > > > Could this flag be unset after device hot-unplug? > > > > > > > I'm hoping someone who thinks about VMs and VFIO often can chime in. > > At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao, > using madvise). > > BUT, I do wonder what value there would be for autonuma to still be active Currently MADV_* is up to 25 #define MADV_COLLAPSE 25, while madvise behavior is of type "int". So it's ok. But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform. > for the remainder of the hypervisor. If there is none, a prctl() would be > better. Add a new field in "struct vma_numab_state" in vma, and use prctl() to update this field? e.g. struct vma_numab_state { unsigned long next_scan; unsigned long next_pid_reset; unsigned long access_pids[2]; bool no_scan; }; > > We already do have a mechanism in QEMU to get notified when longterm-pinning > in the kernel might happen (and, therefore, MADV_DONTNEED must not be used): > * ram_block_discard_disable() > * ram_block_uncoordinated_discard_disable() Looks this ram_block_discard allow/disallow state is global rather than per-VMA in QEMU. So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and it's up to the user space to choose between per-VMA and complex way or global and simpler way?