Re: [kvm-devel] [PATCH 00/10] PV-IO v3

Rusty Russell <rusty@xxxxxxxxxxxxxxx> · Fri, 17 Aug 2007 11:25:53 +1000

On Thu, 2007-08-16 at 19:13 -0400, Gregory Haskins wrote:
> Here is the v3 release of the patch series for a generalized PV-IO
> infrastructure.  It has v2 plus the following changes:

Hi Gregory,

	This is a lot of code.  I'm having trouble taking it all in, TBH.  It
might help me if we could to go back to the basic transport
implementation questions.

Transport has several parts.  What the hypervisor knows about (usually
shared memory and some interrupt mechanism and possibly "DMA") and what
is convention between users (eg. ringbuffer layouts).  Whether it's 1:1
or n-way (if 1:1, is it symmetrical?).  Whether it has to be host <->
guest, or can be inter-guest.  Whether it requires trust between the
sides.

My personal thoughts are that we should be aiming for 1:1 untrusting.  I
like N-way, but it adds complexity.  And not having inter-guest is just
poor form (and putting it in later is impossible, as we'll see).

It seems that a shared-memory "ring-buffer of descriptors" is the
simplest implementation.  But there are two problems with a simple
descriptor ring:

        1) A ring buffer doesn't work well for things which process
        out-of-order, such as a block device.
        2) We either need huge descriptors or some chaining mechanism to
        handle scatter-gather.

So we end up with an array of descriptors with next pointers, and two
ring buffers which refer to those descriptors: one for what descriptors
are pending, and one for what descriptors have been used (by the other
end).

This is sufficient for guest<->host, but care must be taken for guest
<-> guest.  Let's dig down:

Consider a transport from A -> B.  A populates the descriptor entries
corresponding to its sg, then puts the head descriptor entry in the
"pending" ring buffer and sends B an interrupt.  B sees the new pending
entry, reads the descriptors, does the operation and reads or writes
into the memory pointed to by the descriptors.  It then updates the
"used" ring buffer and sends A an interrupt.

Now, if B is untrusted, this is more difficult.  It needs to read the
descriptor entries and the "pending" ring buffer, and write to the
"used" ring buffer.  We can use page protection to share these if we
arrange things carefully, like so:

        struct desc_pages
        {
        	/* Page of descriptors. */
        	struct lguest_desc desc[NUM_DESCS];

        	/* Next page: how we tell other side what buffers are available. */
        	unsigned int avail_idx;
        	unsigned int available[NUM_DESCS];
        	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];

        	/* Third page: how other side tells us what's used. */
        	unsigned int used_idx;
        	struct lguest_used used[NUM_DESCS];
        };

But we still have the problem of an untrusted B having to read/write A's
memory pointed to A's descriptors.  At this point, my preferred solution
so far is as follows (note: have not implemented this!):

(1) have the hypervisor be aware of the descriptor page format, location
and which guest can access it.
(2) have the descriptors themselves contains a type (read/write) and a
valid bit.
(3) have a "DMA" hypercall to copy to/from someone else's descriptors.

Note that this means we do a copy for the untrusted case which doesn't
exist for the trusted case.  In theory the hypervisor could do some
tricky copy-on-write page-sharing for very large well-aligned buffers,
but it remains to be seen if that is actually useful.

Sorry for the long mail, but I really want to get the mechanism correct.

Cheers,
Rusty.

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/virtualization