On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner <christian.brauner@xxxxxxxxxx> wrote: > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > This patchset adds support for the remote mapping feature. > > Remote mapping, as its name suggests, is a means for transparent and > > zero-copy access of a remote process' address space. > > access of a remote process' address space. > > > > The feature was designed according to a specification suggested by > > Paolo Bonzini: > > >> The proposed API is a new pidfd system call, through which the parent > > >> can map portions of its virtual address space into a file descriptor > > >> and then pass that file descriptor to a child. > > >> > > >> This should be: > > >> > > >> - upstreamable, pidfd is the new cool thing and we could sell it as a > > >> better way to do PTRACE_{PEEK,POKE}DATA > > In all honesty, that sentence made me a bit uneasy as it reads like this > is implemented on top of pidfds because it makes it more likely to go > upstream not because it is the right design. To be clear, I'm not > implying any sort of malicious intent on your part but I would suggest > to phrase this a little better. :) I thought about this whole thing some more, and here are some thoughts. First, I was nervous about two things. One was faulting in pages from the wrong context. (When a normal page fault or KVM faults in a page, the mm is loaded. (In the KVM case, the mm is sort of not loaded when the actual fault happens, but the mm is loaded when the fault is handled, I think. Maybe there are workqueues involved and I'm wrong.) When a remote mapping faults in a page, the mm is *not* loaded.) This ought not to be a problem, though -- get_user_pages_remote() also faults in pages from a non-current mm, and that's at least supposed to work correctly, so maybe this is okay. Second is recursion. I think this is a genuine problem. And I think that tying this to pidfds is the wrong approach. In fact, tying it to processes at all seems wrong. There is a lot of demand for various forms of memory isolation in which memory is mapped only by its intended user. Using something tied to a process mm gets in the way of this in the same way that KVM's current mapping model gets in the way. All that being said, I think the whole idea of making fancy address spaces composed from other mappable objects is neat and possibly quite useful. And, if you squint a bit, this is a lot like what KVM does today. So I suggest something that may be more generally useful as an alternative. This is a sketch and very subject to bikeshedding: Create an empty address space: int address_space_create(int flags, etc); Map an fd into an address space: int address_space_mmap(int asfd, int fd_to_map, offset, size, prot, ...); /* might run out of args here */ Unmap from an address space: int address_space_munmap(int asfd, unsigned long addr, unsigned long len); Stick an address space into KVM: ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd); /* or similar */ Maybe some day allow mapping an address space into a process. mmap(..., asfd, ...); And at least for now, there's a rule that an address space that is address_space_mmapped into an address space is disallowed. Maybe some day we also allow mremap(), madvise(), etc. And maybe some day we allow creating a special address_space that represents a real process's address space. Under the hood, an address_space could own an mm_struct that is not used by any tasks. And we could have special memfds that are bound to a VM such that all you can do with them is stick them into an address_space and map that address_space into the VM in question. For this to work, we would want a special vm_operation for mapping into a VM. What do you all think? Is this useful? Does it solve your problems? Is it a good approach going forward?