Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs

Cameron Macdonell <cam@xxxxxxxxxxxxxx> · Sat, 18 Apr 2009 23:22:57 -0600

Hi Avi and Anthony,

Sorry for the top-reply, but we haven't discussed this aspect here  
before.

I've been thinking about how to implement interrupts.  As far as I can  
tell, unix domain sockets in Qemu/KVM are used point-to-point with one  
VM being the server by specifying "server" along with the unix:  
option.  This works simply for two VMs, but I'm unsure how this can  
extend to multiple VMs.  How would a server VM know how many clients  
to wait for?  How can messages then be multicast or broadcast?  Is a  
separate "interrupt server" necessary?

Thanks,
Cam

On 1-Apr-09, at 12:52 PM, Anthony Liguori wrote:

Avi Kivity wrote:
Anthony Liguori wrote:
Hi Cam,

I would suggest two design changes to make here.  The first is  
that I think you should use virtio.

I disagree with this.  While virtio is excellent at exporting guest  
memory, it isn't so good at importing another guest's memory.

First we need to separate static memory sharing and dynamic memory  
sharing.  Static memory sharing has to be configured on start up.  I  
think in practice, static memory sharing is not terribly interesting  
except for maybe embedded environments.

Dynamically memory sharing requires bidirectional communication in  
order to establish mappings and tear down mappings.  You'll  
eventually recreate virtio once you've implemented this  
communication mechanism.

 The second is that I think instead of relying on mapping in  
device memory to the guest, you should have the guest allocate  
it's own memory to dedicate to sharing.

That's not what you describe below.  You're having the guest  
allocate parts of its address space that happen to be used by RAM,  
and overlaying those parts with the shared memory.

But from the guest's perspective, it's RAM is being used for memory  
sharing.

If you're clever, you could start a guest with -mem-path and then  
use this mechanism to map a portion of one guest's memory into  
another guest without either guest ever knowing who "owns" the  
memory and with exactly the same driver on both.

Right now, you've got a bit of a hole in your implementation  
because you only support files that are powers-of-two in size even  
though that's not documented/enforced.  This is a limitation of  
PCI resource regions.

While the BAR needs to be a power of two, I don't think the RAM  
backing it needs to be.

Then you need a side channel to communicate the information to the  
guest.

Also, the PCI memory hole is limited in size today which is going  
to put an upper bound on the amount of memory you could ever map  
into a guest.

Today.  We could easily lift this restriction by supporting 64-bit  
BARs.  It would probably take only a few lines of code.

Since you're using qemu_ram_alloc() also, it makes hotplug  
unworkable too since qemu_ram_alloc() is a static allocation from  
a contiguous heap.

We need to fix this anyway, for memory hotplug.

It's going to be hard to "fix" with TCG.

If you used virtio, what you could do is provide a ring queue that  
was used to communicate a series of requests/response.  The  
exchange might look like this:

guest: REQ discover memory region
host: RSP memory region id: 4 size: 8k
guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k),  
(addr=944000,size=4k)}
host: RSP mapped region id: 4
guest: REQ notify region id: 4
host: RSP notify region id: 4
guest: REQ poll region id: 4
host: RSP poll region id: 4

That looks significantly more complex.

It's also supporting dynamic shared memory.  If you do use BARs,  
then perhaps you'd just do PCI hotplug to make things dynamic.

And the REQ/RSP order does not have to be in series like this.  In  
general, you need one entry on the queue to poll for new memory  
regions, one entry for each mapped region to poll for incoming  
notification, and then the remaining entries can be used to send  
short-lived requests/responses.

It's important that the REQ map takes a scatter/gather list of  
physical addresses because after running for a while, it's  
unlikely that you'll be able to allocate any significant size of  
contiguous memory.

From a QEMU perspective, you would do memory sharing by waiting  
for a map REQ from the guest and then you would complete the  
request by doing an mmap(MAP_FIXED) with the appropriate  
parameters into phys_ram_base.

That will fragment the vma list.  And what do you do when you unmap  
the region?

How does a 256M guest map 1G of shared memory?

It doesn't but it couldn't today either b/c of the 32-bit BARs.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-----------------------------------------------
A. Cameron Macdonell
Ph.D. Student
Department of Computing Science
University of Alberta
cam@xxxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html