Re: passing FDs across domains

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 20, 2019 at 9:11 PM Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote:
>
>   Hi,
>
> > While Tomeu and others have been mostly considering the graphics use
> > cases here (2D, 3D, Wayland), we've been looking into a wider range of
> > use cases involving passing buffers between guest and host:
> >  - video decoder/encoder (the general idea of the hardware may be
> > inferred from corresponding V4L2 APIs modeling those:
> > https://hverkuil.home.xs4all.nl/codec-api/uapi/v4l/dev-mem2mem.html),
>
> Camera was mentioned too.
>

Right, forgot to mention it.

Actually for cameras it gets complicated even if we put the buffer
sharing aside. The key point is that on modern systems, which need
more advanced camera capabilities than a simple UVC webcam, the camera
is in fact a whole subsystem of hardware components, i.e. sensors,
lenses, raw capture I/F and 1 or more ISPs. Currently the only
relatively successful way of standardizing the way to control those is
the Android Camera HALv3 API, which is a relatively complex userspace
interface. Getting feature parity, which is crucial for the use cases
Chrome OS is targeting, is going to require quite a sophisticated
interface between the host and guest.

> >  - generic IPC pass-through (e.g. Mojo, used extensively in Chromium),
>
> What do you need here?
>

There are various use cases, which make sense only for one particular
combination of host and guest and creating dedicated virtio interfaces
would not only be an overkill, but also relatively big and not very
useful effort.

I can't really name them here at the moment, due to their closed
nature. However, let's say we have a service in the Chrome OS host
that communicates with clients, other Chrome OS processes, using the
Mojo IPC. Mojo is just yet another IPC designed to work over a Unix
socket, relying on file descriptor passing (SCM_RIGHTS) for passing
various platform handles (e.g. DMA-bufs). The clients exchange
DMA-bufs with the service.

All the components involved (service and clients) are heavily specific
to Chrome OS. Now we also need to put a client of that service inside
our VM guests. The natural solution that one could think of would be
to let the client talk to the service using vsock, extended with
support for SCM_RIGHTS.

> >  - crypto hardware accelerators.
>
> Note: there is virtio-crypto.
>

Thanks, that's a useful pointer.

One more aspect is that the nature of some data may require that only
the host can access the decrypted data. From my quick look at
virtio-crypto, it seems to always use guest memory for input and
output of the algorithm, which wouldn't let us achieve that. Of course
that's something that could be added as an extension for next version
of the interface, for example by allowing virtio-crypto to use some
kind of resource handles to refer to buffers.

> > > > 1. Denial of Service.  A mechanism that involves pinning and relies on
> > > > guest cooperation needs to be careful to prevent buggy or malicious
> > > > guests from hogging resources that the host cannot reclaim.
> >
> > I think we're talking about guest pages pinned by the guest kernel,
> > right? The host (VMM) would be still the ultimate owner of all the
> > memory given to the guest (VM), so how would it make it impossible for
> > the host to reclaim that memory? Of course forcefully removing those
> > pages from the guest could possibly crash it, but if it was
> > malicious/buggy, I wouldn't expect it to work correctly anyway.
>
> Guest dma-bufs are pinned in the guest kernel only.
>
> Guest may ask the host to create host dma-bufs (which are pinned on the
> host kernel), in which case the host can allow or deny that depending on
> policy, limits, ...
>

Still, the VMM would be the owner of the memory, fully responsible for
keeping the buffer present (pinned) for the lifetime of the VM. If for
some reason we run out of memory in the host, the OOM killer could
just kill the VM and those buffers would be reclaimed. (Or one could
have a more elaborate mechanism, which gives the VM guest a chance to
give those buffers back to save itself from being killed.)

> > 2) The buffer was allocated entirely by the guest, from regular guest
> > pages. (Currently we don't support this in Chromium OS.)
> >
> > I assume the guest would send a list of pages (physical addresses,
> > pfns?) to share.
>
> Almost.  Not *that* simple, there can be a iommu so we have to use dma
> addresses not ram addresses.  But, yes, this is what virtio-gpu is
> fundamentally doing when creating resources.
>

Agreed.

> > The host should be able to translate those into its
> > own user memory addresses, since they would have been inserted to the
> > guest by the host itself (using KVM_SET_USER_MEMORY_REGION). Then,
> > whoever wants to access such memory on host side would have to import
> > that to the host kernel, using some kind of userptr API, which would
> > pin those pages on the host side.
>
> WIP patches exist to use the new udmabuf driver for turning those guest
> virtio-gpu resources into host dma-bufs.
>

Great, thanks.

> > With that, even if the guest goes away, corresponding host memory
> > would be still available, given that the above look-up and import
> > procedure is mutually exclusive with guest termination code. Again, a
> > simple mutex protecting the VMM user memory bookkeeping structures
> > should work.
>
> Yes.
>
> > > > On the other hand, the host
> > > > process shouldn't be able to hang the guest either by keeping the
> > > > dma-buf alive.
> >
> > I'm probably missing something. Could you elaborate on how that could
> > happen? (Putting aside any side effects of the guest itself
> > misbehaving, but then it's their fault that something bad happens to
> > it.)
>
> With dma-bufs backed by guest ram pages the guest can't reuse those
> pages for something else as long as the dma-buf exists.
>
> Which isn't much of a problem as long as qemu uses the dma-buf
> internally only (qemu can get a linear mapping of a scattered buffer
> that way).  Qemu can simply cleanup before it acks the resource destroy
> command to the guest.  But when handing out dma-buf file handles to
> other processes this needs some care.
>

Yeah, I don't think we could ack the resource destroy before all the
host processes holding references to the buffer release them.

But actually, in the cases being considered here, another host process
would only get a reference to such DMA-buf as a result of the guest
itself passing a reference to such buffer over a corresponding
interface (e.g. video decoder). In such case, the guest driver would
keep a reference internally as well, until it gets a confirmation that
the buffer is not needed by the host anymore through the same
interface (here: video decoder).

> > > > 2. dma-buf passing only works in the guest->host direction.  It
> > > > doesn't work host<->host or guest<->guest (if we decide to support it
> > > > in the future) because a guest cannot "see" memory ranges from the
> > > > host or other guests.  I don't like this asymmetry but I guess we
> > > > could live with it.
> >
> > Why? The VMM given a host-side DMA-buf FD could mmap it and insert
> > into the guest address space, using KVM_SET_USER_MEMORY_REGION or some
> > other means (PCI memory bar?).
>
> Yes, pci memory bar, for address space management reasons.
>
> > > > I wonder if it would be cleaner to extend virtio-gpu for this use case
> > > > instead of trying to pass buffers over AF_VSOCK.
> >
> > We've been having some discussion about those other use cases and I
> > believe there are several advantages of not tying this to virtio-gpu,
> > e.g.
> >
> > 1) vsock with SCM_RIGHTS would keep the POSIX socket API, for higher
> > reusability of the userspace,
>
> I have my doubts that it is possible to make this fully transparent.
> For starters only certain kinds of file handles will work.
>

Agreed. It would only work for the subset of file descriptors we could
support. That would still work for the protocols where we know (or
control) what kind of handles are being exchanged.

> > 2) no need for different userspace proxies for different protocols
> > (mojo, wayland) and different transports (vsock, virtio-gpu),
>
> I'm not convinced it is that easy.  wayland clients for example expect
> they can pass sysv shm handles.
>
> One problem with sysv shm is that you can resize buffers.  Which in turn
> is the reason why we have memfs with sealing these days.

Indeed shm is a bit problematic. However, passing file descriptors of
pipe-like objects or regular files could be implemented with a
reasonable amount of effort, if some performance trade-offs are
acceptable.

>
> > 3) no tying of other capabilities (e.g. video encoder, Mojo IPC
> > pass-through, crypto) to the existence of virtio-gpu in the system,
>
> OK.
>
> > 4) simpler from the virtio interfaces point of view - no need for all
> > the virtio-gpu complexity for a simple IPC pass-through,
>
> OK.
>
> > 5) could be a foundation for implementing sharing of other objects,
> > e.g. guest shared memory by looking up guest pages in the host and
> > constructing a shared memory object out of it (useful for shm-based
> > wayland clients that allocate shm memory themselves).
>
> For one: see the sysv shm note above.
>
> Second: How do you want create a sysv shm object on the host side?
>

Yep, shm is problematic indeed. It wasn't a good example. However, For
protocols like Wayland, shm buffers are problematic in general, even
if we stay entirely in the host land. It's not the kind of memory the
GPUs or other hardware accelerators prefer (or even are able) to work
with.

I don't think it's impossible, though. I'll give it a bit more
thought, because it's an interesting problem.

> Third: Any plan for passing virtio-gpu resources to the host side when
> running wayland over virtio-vsock?  With dumb buffers it's probably not
> much of a problem, you can grab a list of pages and run with it.  But
> for virgl-rendered resources (where the rendered data is stored in a
> host texture) I can't see how that will work without copying around the
> data.

I think it could work the same way as with the virtio-gpu window
system pipe being proposed in another thread. The guest vsock driver
would figure out that the FD the userspace is trying to pass points to
a virtio-gpu resource, convert that to some kind of a resource handle
(or descriptor) and pass that to the host. The host vsock
implementation would then resolve the resource handle (descriptor)
into an object that can be represented as a host file descriptor
(DMA-buf?). I'd expect that buffers that are used for Wayland surfaces
would be more than just a regular GL(ES) texture, since the compositor
and virglrenderer would normally be different processes, with the
former not having any idea of the latter's textures.

By the way, are you perhaps planning to visit the Open Source Summit
Japan in July [1]? If so, we could try to arrange some discussion
there, on anything we don't manage to settle before. :)

[1] https://events.linuxfoundation.org/events/open-source-summit-japan-2019/

Best regards,
Tomasz



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux