On Mon, Sep 07, 2020 at 01:43:48PM -0700, Andy Lutomirski wrote: > On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner > <christian.brauner@xxxxxxxxxx> wrote: > > > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > > This patchset adds support for the remote mapping feature. > > > Remote mapping, as its name suggests, is a means for transparent and > > > zero-copy access of a remote process' address space. > > > access of a remote process' address space. > > > > > > The feature was designed according to a specification suggested by > > > Paolo Bonzini: > > > >> The proposed API is a new pidfd system call, through which the parent > > > >> can map portions of its virtual address space into a file descriptor > > > >> and then pass that file descriptor to a child. > > > >> > > > >> This should be: > > > >> > > > >> - upstreamable, pidfd is the new cool thing and we could sell it as a > > > >> better way to do PTRACE_{PEEK,POKE}DATA > > > > In all honesty, that sentence made me a bit uneasy as it reads like this > > is implemented on top of pidfds because it makes it more likely to go > > upstream not because it is the right design. To be clear, I'm not > > implying any sort of malicious intent on your part but I would suggest > > to phrase this a little better. :) > > > I thought about this whole thing some more, and here are some thoughts. > > First, I was nervous about two things. One was faulting in pages from > the wrong context. (When a normal page fault or KVM faults in a page, > the mm is loaded. (In the KVM case, the mm is sort of not loaded when > the actual fault happens, but the mm is loaded when the fault is > handled, I think. Maybe there are workqueues involved and I'm wrong.) > When a remote mapping faults in a page, the mm is *not* loaded.) > This ought not to be a problem, though -- get_user_pages_remote() also > faults in pages from a non-current mm, and that's at least supposed to > work correctly, so maybe this is okay. > > Second is recursion. I think this is a genuine problem. > > And I think that tying this to pidfds is the wrong approach. In fact, > tying it to processes at all seems wrong. There is a lot of demand > for various forms of memory isolation in which memory is mapped only > by its intended user. Using something tied to a process mm gets in > the way of this in the same way that KVM's current mapping model gets > in the way. > > All that being said, I think the whole idea of making fancy address > spaces composed from other mappable objects is neat and possibly quite > useful. And, if you squint a bit, this is a lot like what KVM does > today. > > So I suggest something that may be more generally useful as an > alternative. This is a sketch and very subject to bikeshedding: > > Create an empty address space: > > int address_space_create(int flags, etc); > > Map an fd into an address space: > > int address_space_mmap(int asfd, int fd_to_map, offset, size, prot, > ...); /* might run out of args here */ > > Unmap from an address space: > > int address_space_munmap(int asfd, unsigned long addr, unsigned long len); > > Stick an address space into KVM: > > ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd); /* or similar */ > > Maybe some day allow mapping an address space into a process. > > mmap(..., asfd, ...); > > > And at least for now, there's a rule that an address space that is > address_space_mmapped into an address space is disallowed. > > > Maybe some day we also allow mremap(), madvise(), etc. And maybe some > day we allow creating a special address_space that represents a real > process's address space. > > > Under the hood, an address_space could own an mm_struct that is not > used by any tasks. And we could have special memfds that are bound to > a VM such that all you can do with them is stick them into an > address_space and map that address_space into the VM in question. For > this to work, we would want a special vm_operation for mapping into a > VM. > > > What do you all think? Is this useful? Does it solve your problems? > Is it a good approach going forward? Hi Adalbert and Andy, As everyone continues to discuss how the mechanism should look, I want to share two use cases for something like this. Let me know if you would like more detail on these use cases. They requirement in both cases is that process A can map a virtual memory range from process B so that mmap/munmap operations within the memory range in process B also affect process A. An enforcing vIOMMU for vhost-user and vfio-user ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ vhost-user, vfio-user, and other out-of-process device emulation interfaces need a way for the virtual machine manager (VMM) to enforce the vIOMMU mappings on the device emulation process. The VMM emulates the vIOMMU and only wants to expose a subset of memory to the device emulation process. This subset can change as the guest programs the vIOMMU. Today the VMM passes all guest RAM fds to the device emulation process and has no way of restricting access or revoking it later. The new mechanism would allow the VMM to add/remove mappings so that the device emulation process can only access ranges of memory programmed by the guest vIOMMU. Accesses to unmapped addresses would raise a signal. Accelerating the virtio-fs DAX window ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The virtiofsd vhost-user process handles guest file map/unmap messages. The map/unmap messages allow the guest to map ranges of files into its memory space. The guest kernel then uses DAX to access the file pages without copying their contents into the guest page cache and mmap MAP_SHARED is coherent when guests access the same file. Today virtiofsd sends a message to the VMM over a UNIX domain socket asking for an mmap/munmap. The VMM must perform the mapping on behalf of virtiofsd. This communication and file descriptor passing is clumsy and slow. The new mechanism would allow virtiofsd to map/unmap without extra coordination with the VMM. The VMM only needs to perform an initial mmap of the DAX window so that kvm.ko can resolve page faults to that region. Stefan
Attachment:
signature.asc
Description: PGP signature