Re: [PATCH RFC 3/4] KVM: Define KVM_USER_MEM_SLOTS in arch-neutral include/linux/kvm_host.h

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 20 Jan 2021 09:25:13 -0800

On Wed, Jan 20, 2021, Vitaly Kuznetsov wrote:
> Igor Mammedov <imammedo@xxxxxxxxxx> writes:
> 
> > On Tue, 19 Jan 2021 09:20:42 -0800
> > Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> >> 
> >> Were you planning on adding a capability to check for the new and improved
> >> memslots limit, e.g. to know whether or not KVM might die on a large VM?
> >> If so, requiring the VMM to call an ioctl() to set a higher (or lower?) limit
> >> would be another option.  That wouldn't have the same permission requirements as
> >> a module param, but it would likely be a more effective safeguard in practice,
> >> e.g. use cases with a fixed number of memslots or a well-defined upper bound
> >> could use the capability to limit themselves.
> > Currently QEMU uses KVM_CAP_NR_MEMSLOTS to get limit, and depending on place the
> > limit is reached it either fails gracefully (i.e. it checks if free slot is
> > available before slot allocation) or aborts (in case where it tries to allocate
> > slot without check).
> 
> FWIW, 'synic problem' causes it to abort.
> 
> > New ioctl() seems redundant as we already have upper limit check
> > (unless it would allow go over that limit, which in its turn defeats purpose of
> > the limit).

Gah, and I even looked for an ioctl().  No idea how I didn't find NR_MEMSLOTS.

> Right, I didn't plan to add any new CAP as what we already have should
> be enough to query the limits. Having an ioctl to set the upper limit
> seems complicated: with device and CPU hotplug it may not be easy to
> guess what it should be upfront so VMMs will likely just add a call to
> raise the limit in memslot modification code so it won't be providing
> any new protection.

True.  Maybe the best approach, if we want to make the limit configurable, would
be to make a lower limit opt-in.  I.e. default=KVM_CAP_NR_MEMSLOTS=SHRT_MAX,
with an ioctl() to set a lower limit.  That would also allow us to defer adding
a new ioctl() until someone actually plans on using it.

> >> Thoughts?  An ioctl() feels a little over-engineered, but I suspect that adding
> >> a module param that defaults to N*KVM_MAX_VPCUS will be a waste, e.g. no one
> >> will ever touch the param and we'll end up with dead, rarely-tested code.
> 
> Alternatively, we can hard-code KVM_USER_MEM_SLOTS to N*KVM_MAX_VPCUS so
> no new parameter is needed but personally, I'd prefer to have it
> configurable (in case we decide to set it to something lower than
> SHRT_MAX of course) even if it won't be altered very often (which is a
> good thing for 'general purpose' usage, right?).
>
> First, it will allow tightening the limit for some very specific deployments
> (e.g. FaaS/ Firecracker-style) to say '20' which should be enough.

A module param likely isn't usable for many such deployments though, as it would
require a completely isolated pool of systems.  That's why an ioctl() is
appealing; the expected number of memslots is a property of the VM, not of the
platform.

> Second, we may be overlooking some configurations where the number of
> memslots is actually dependent on the number of vCPUs but nobody complained
> so far just because these configutrarions use a farly small number and the
> ceiling wasn't hit yet.
> 
> One more spare thought. Per-vCPU memslots may come handy if someone
> decides to move some of the KVM PV features to userspace. E.g. I can
> imagine an attempt to move async_pf out of kernel.

Memslots aren't per-vCPU though, any userspace feature that tries to treat them
as such will be flawed in some way.  Practically speaking, per-vCPU memslots
just aren't feasible, at least not without KVM changes that are likely
unpalatable, e.g. KVM would need to incorporate the existence of a vCPU specific
memslot into the MMU role.