Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd

David Hildenbrand <david@xxxxxxxxxx> · Mon, 11 Nov 2024 23:14:19 +0100

On 11.11.24 12:02, Vlastimil Babka wrote:
On 11/8/24 18:31, Paolo Bonzini wrote:
On 11/7/24 16:10, Matthew Wilcox wrote:
On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
The folio allocation path from guest_memfd typically looks like this...

kvm_gmem_get_folio
    filemap_grab_folio
      __filemap_get_folio
        filemap_alloc_folio
          __folio_alloc_node_noprof
            -> goes to the buddy allocator

Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.

It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
real problem that you're trying to solve that cpusets are being used
incorrectly?

If it's false it's not very different, it goes to alloc_pages_noprof().
Then it respects the process's policy, but the policy is not
customizable without mucking with state that is global to the process.

Taking a step back: the problem is that a VM can be configured to have
multiple guest-side NUMA nodes, each of which will pick memory from the
right NUMA node in the host.  Without a per-file operation it's not
possible to do this on guest_memfd.  The discussion was whether to use
ioctl() or a new system call.  The discussion ended with the idea of
posting a *proposal* asking for *comments* as to whether the system call
would be useful in general beyond KVM.

Commenting on the system call itself I am not sure I like the
file_operations entry, though I understand that it's the simplest way to
implement this in an RFC series.  It's a bit surprising that fbind() is
a total no-op for everything except KVM's guest_memfd.

Maybe whatever you pass to fbind() could be stored in the struct file *,
and used as the default when creating VMAs; as if every mmap() was
followed by an mbind(), except that it also does the right thing with
MAP_POPULATE for example.  Or maybe that's a horrible idea?

mbind() manpage has this:

        The  specified  policy  will  be  ignored  for  any MAP_SHARED
mappings in the specified memory range.  Rather the pages will be allocated
according to the memory policy of the thread that caused the page to be
allocated. Again, this may not be the thread that called mbind().

I recall discussing that a couple of times in the context of QEMU. I 
have some faint recollection that the manpage is a bit imprecise:

IIRC, hugetlb also ends up using the VMA policy for MAP_SHARED mappings 
during faults (huge_node()->get_vma_policy()) -- but in contrast to 
shmem, it doesn't end up becoming the "shared" policy for the file, used 
when accessed through other VMAs.

So that seems like we're not very keen on having one user of a file set a
policy that would affect other users of the file?

For VMs in QEMU we really want to configure the policy once in the main 
process and have all other processes (e.g., vhost-user) not worry about 
that when they mmap() guest memory.

With shmem this works by "shared policy" design (below). For hugetlb, we 
rely on the fact that mbind()+MADV_POPULATE_WRITE allows us to 
preallocate NUMA-aware. So with hugetlb we really preallocate all guest 
RAM to guarantee the NUMA placement.

It would not be the worst idea to have a clean interface to configure 
file-range policies instead of having this weird shmem mbind() behavior 
and the hugetlb hack.

Having that said, other filesystem are rarely used for backing VMs, at 
least in combination with NUMA. So nobody really cared that much for now.

Maybe fbind() would primarily only be useful for in-memory filesystems 
(shmem/hugetlb/guest_memfd).

Now the next paragraph of the manpage says that shmem is different, and
guest_memfd is more like shmem than a regular file.

My conclusion from that is that fbind() might be too broad and we don't want
this for actual filesystem-backed files? And if it's limited to guest_memfd,
it shouldn't be an fbind()?

I was just once again diving into how mbind() on shmem is handled. And 
in fact, mbind() will call vma->set_policy() to update the per 
file-range policy. I wonder why we didn't do the same for hugetlb ... 
but of course, hugetlb must be special in any possible way.

Not saying it's the best idea, but as we are talking about mmap support 
of guest_memfd (only allowing to fault in shared/faultable pages), one 
*could* look into implementing mbind()+vma->set_policy() for guest_memfd 
similar to how shmem handles it.

It would require a (temporary) dummy VMA in the worst case (all private 
memory).

It sounds a bit weird, though, to require a VMA to configure this, 
though. But at least it's similar to what shmem does ...

--
Cheers,

David / dhildenb