Re: [RFC PATCH 0/6] kvm: Growable memory slot array

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 05 Dec 2012 15:57:48 -0700

On Wed, 2012-12-05 at 19:32 -0200, Marcelo Tosatti wrote:
> On Mon, Dec 03, 2012 at 04:39:05PM -0700, Alex Williamson wrote:
> > Memory slots are currently a fixed resource with a relatively small
> > limit.  When using PCI device assignment in a qemu guest it's fairly
> > easy to exhaust the number of available slots.  I posted patches
> > exploring growing the number of memory slots a while ago, but it was
> > prior to caching memory slot array misses and thefore had potentially
> > poor performance.  Now that we do that, Avi seemed receptive to
> > increasing the memory slot array to arbitrary lengths.  I think we
> > still don't want to impose unnecessary kernel memory consumptions on
> > guests not making use of this, so I present again a growable memory
> > slot array.
> > 
> > A couple notes/questions; in the previous version we had a
> > kvm_arch_flush_shadow() call when we increased the number of slots.
> > I'm not sure if this is still necessary.  I had also made the x86
> > specific slot_bitmap dynamically grow as well and switch between a
> > direct bitmap and indirect pointer to a bitmap.  That may have
> > contributed to needing the flush.  
> 
> I don't remember. Do you recall what was the argument back then?
> (there must have been some).

I vaguely recall chatting with you on irc about it before posting, so
unfortunately there's no list discussion.  It's been almost 2 years, so
it's not surprising we've all forgotten.  Here's the original post:

http://article.gmane.org/gmane.linux.kernel/1103962

(click on the subject to get to the thread)  That version also included
an optimization to the x86-only slot_bitmap, and it's entirely possible
the flush had more to do with that than the memslots themselves.  I
think Avi kind of alludes to this in his first reply that the flushing
is more aggressive than necessary and indicates it could happen only
when crossing BITS_PER_LONG boundaries.

> > I haven't done that yet here
> > because it seems like an unnecessary complication if we have a max
> > on the order of 512 or 1024 entries.  A bit per slot isn't a lot of
> > overhead.  If we want to go more, maybe we should make it switch.
> > That leads to the final question, we need an upper bound since this
> > does allow consumption of extra kernel memory, what should it be?  A
> > PCI bus filled with assigned devices can theorically use up to 2048
> > slots (32 devices * 8 functions * (6 BARs + ROM + possibly split
> > MSI-X BAR)).  For this RFC, I don't change the max, just make it
> > grow up to 32 user slots.  Untested on anything but x86 so far.
> > Thanks,
> 
> Not sure. Some reasonable number based on current usage expectations?
> (can be increased later if necessary).

The first obvious step is to double it to 64 slots.  With typical
devices, that would give us 16+ assigned devices.  There are already
people bumping into the 8 device limit we set in RHEL, so doubling it
doesn't feel like much headroom.

If we double again to 128 slots then we can likely support 32 typical
devices.  That's a full PCI bus of single function devices.  That's
probably the first acceptable step.

It looks like each slot on x86_64 is 64bytes (somehow I was throwing
around 72bytes before, not sure where I counted wrong), so we currently
have:

32 user + 4 private slots = 36*64 = 2304
32+4 id_to_index = 36*4 = 144
32+4 entry slot_bitmap = 8
Total = 2456

At 132 (128+4), this becomes 8448 + 528 + 24 = 9000 bytes

We can actually compact struct kvm_memory_slot down to 56 bytes (flags
-> u32, user_alloc -> bool, id -> short), which also cuts id_to_index in
half, so that gives us: 7392 + 264 + 24 = 7680

(I might sacrifice a couple user slots just to make these powers of 2,
ie. 124 user + 4 private = 128, 7440 bytes)

Should we target that as a first step and ignore all this extra
complication?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html