On Thu, Aug 17, 2023 at 09:38:37AM +0200, David Hildenbrand wrote: > On 17.08.23 07:05, Yan Zhao wrote: > > On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote: > > > On 8/16/23 02:49, David Hildenbrand wrote: > > > > But do 32bit architectures even care about NUMA hinting? If not, just > > > > ignore them ... > > > > > > Probably not! > > > > > > ... > > > > > So, do you mean that let kernel provide a per-VMA allow/disallow > > > > > mechanism, and > > > > > it's up to the user space to choose between per-VMA and complex way or > > > > > global and simpler way? > > > > > > > > QEMU could do either way. The question would be if a per-vma settings > > > > makes sense for NUMA hinting. > > > > > > From our experience with compute on GPUs, a per-mm setting would suffice. > > > No need to go all the way to VMA granularity. > > > > > After an offline internal discussion, we think a per-mm setting is also > > enough for device passthrough in VMs. > > > > BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you > > think it's of any value to providing a flag like VM_MAYDMA? > > Auto NUMA balancing or other components can decide how to use it by > > themselves. > > Short-lived DMA is not really the problem. The problem is long-term pinning. > > There was a discussion about letting user space similarly hint that > long-term pinning might/will happen. > > Because when long-term pinning a page we have to make sure to migrate it off > of ZONE_MOVABLE / MIGRATE_CMA. > > But the kernel prefers to place pages there. > > So with vfio in QEMU, we might preallocate memory for the guest and place it > on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to migrate all > these fresh pages out of these areas again. > > So letting the kernel know about that in this context might also help. > Thanks! Glad to know it :) But consider for GPUs case as what John mentioned, since the memory is not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ? For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace and migration. Is that good? Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for now?