On 11/8/24 18:31, Paolo Bonzini wrote: > On 11/7/24 16:10, Matthew Wilcox wrote: >> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote: >>> The folio allocation path from guest_memfd typically looks like this... >>> >>> kvm_gmem_get_folio >>> filemap_grab_folio >>> __filemap_get_folio >>> filemap_alloc_folio >>> __folio_alloc_node_noprof >>> -> goes to the buddy allocator >>> >>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol. >> >> It only takes that path if cpuset_do_page_mem_spread() is true. Is the >> real problem that you're trying to solve that cpusets are being used >> incorrectly? > > If it's false it's not very different, it goes to alloc_pages_noprof(). > Then it respects the process's policy, but the policy is not > customizable without mucking with state that is global to the process. > > Taking a step back: the problem is that a VM can be configured to have > multiple guest-side NUMA nodes, each of which will pick memory from the > right NUMA node in the host. Without a per-file operation it's not > possible to do this on guest_memfd. The discussion was whether to use > ioctl() or a new system call. The discussion ended with the idea of > posting a *proposal* asking for *comments* as to whether the system call > would be useful in general beyond KVM. > > Commenting on the system call itself I am not sure I like the > file_operations entry, though I understand that it's the simplest way to > implement this in an RFC series. It's a bit surprising that fbind() is > a total no-op for everything except KVM's guest_memfd. > > Maybe whatever you pass to fbind() could be stored in the struct file *, > and used as the default when creating VMAs; as if every mmap() was > followed by an mbind(), except that it also does the right thing with > MAP_POPULATE for example. Or maybe that's a horrible idea? mbind() manpage has this: The specified policy will be ignored for any MAP_SHARED mappings in the specified memory range. Rather the pages will be allocated according to the memory policy of the thread that caused the page to be allocated. Again, this may not be the thread that called mbind(). So that seems like we're not very keen on having one user of a file set a policy that would affect other users of the file? Now the next paragraph of the manpage says that shmem is different, and guest_memfd is more like shmem than a regular file. My conclusion from that is that fbind() might be too broad and we don't want this for actual filesystem-backed files? And if it's limited to guest_memfd, it shouldn't be an fbind()? > Adding linux-api to get input; original thread is at > https://lore.kernel.org/kvm/20241105164549.154700-1-shivankg@xxxxxxx/. > > Paolo > >> Backing up, it seems like you want to make a change to the page cache, >> you've had a long discussion with people who aren't the page cache >> maintainer, and you all understand the pros and cons of everything, >> and here you are dumping a solution on me without talking to me, even >> though I was at Plumbers, you didn't find me to tell me I needed to go >> to your talk. >> >> So you haven't explained a damned thing to me, and I'm annoyed at you. >> Do better. Starting with your cover letter. >> > >