Re: [RESEND RFC PATCH 0/5] Remote mapping

Andy Lutomirski <luto@xxxxxxxxxx> · Mon, 7 Sep 2020 13:43:48 -0700

On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner
<christian.brauner@xxxxxxxxxx> wrote:
>
> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > This patchset adds support for the remote mapping feature.
> > Remote mapping, as its name suggests, is a means for transparent and
> > zero-copy access of a remote process' address space.
> > access of a remote process' address space.
> >
> > The feature was designed according to a specification suggested by
> > Paolo Bonzini:
> > >> The proposed API is a new pidfd system call, through which the parent
> > >> can map portions of its virtual address space into a file descriptor
> > >> and then pass that file descriptor to a child.
> > >>
> > >> This should be:
> > >>
> > >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> > >> better way to do PTRACE_{PEEK,POKE}DATA
>
> In all honesty, that sentence made me a bit uneasy as it reads like this
> is implemented on top of pidfds because it makes it more likely to go
> upstream not because it is the right design. To be clear, I'm not
> implying any sort of malicious intent on your part but I would suggest
> to phrase this a little better. :)

I thought about this whole thing some more, and here are some thoughts.

First, I was nervous about two things.  One was faulting in pages from
the wrong context.  (When a normal page fault or KVM faults in a page,
the mm is loaded.  (In the KVM case, the mm is sort of not loaded when
the actual fault happens, but the mm is loaded when the fault is
handled, I think.  Maybe there are workqueues involved and I'm wrong.)
 When a remote mapping faults in a page, the mm is *not* loaded.)
This ought not to be a problem, though -- get_user_pages_remote() also
faults in pages from a non-current mm, and that's at least supposed to
work correctly, so maybe this is okay.

Second is recursion.  I think this is a genuine problem.

And I think that tying this to pidfds is the wrong approach.  In fact,
tying it to processes at all seems wrong.  There is a lot of demand
for various forms of memory isolation in which memory is mapped only
by its intended user.  Using something tied to a process mm gets in
the way of this in the same way that KVM's current mapping model gets
in the way.

All that being said, I think the whole idea of making fancy address
spaces composed from other mappable objects is neat and possibly quite
useful.  And, if you squint a bit, this is a lot like what KVM does
today.

So I suggest something that may be more generally useful as an
alternative.  This is a sketch and very subject to bikeshedding:

Create an empty address space:

int address_space_create(int flags, etc);

Map an fd into an address space:

int address_space_mmap(int asfd, int fd_to_map, offset, size, prot,
...);  /* might run out of args here */

Unmap from an address space:

int address_space_munmap(int asfd, unsigned long addr, unsigned long len);

Stick an address space into KVM:

ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd);  /* or similar */

Maybe some day allow mapping an address space into a process.

mmap(..., asfd, ...);

And at least for now, there's a rule that an address space that is
address_space_mmapped into an address space is disallowed.

Maybe some day we also allow mremap(), madvise(), etc.  And maybe some
day we allow creating a special address_space that represents a real
process's address space.

Under the hood, an address_space could own an mm_struct that is not
used by any tasks.  And we could have special memfds that are bound to
a VM such that all you can do with them is stick them into an
address_space and map that address_space into the VM in question.  For
this to work, we would want a special vm_operation for mapping into a
VM.

What do you all think?  Is this useful?  Does it solve your problems?
Is it a good approach going forward?