Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS

Nicolas Saenz Julienne <nsaenz@xxxxxxxxxx> · Fri, 1 Dec 2023 15:25:06 +0000

Hi Maxim,

On Tue Nov 28, 2023 at 6:56 AM UTC, Maxim Levitsky wrote:
> On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> > From: Anel Orazgaliyeva <anelkz@xxxxxxxxx>
> >
> > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> > ids into two. The lower bits, the physical APIC id, represent the part
> > that's exposed to the guest. The higher bits, which are private to KVM,
> > groups APICs together. APICs in different groups are isolated from each
> > other, and IPIs can only be directed at APICs that share the same group
> > as its source. Furthermore, groups are only relevant to IPIs, anything
> > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> > PV-IPIs is targeted at the default APIC group, group 0.
> >
> > When routing IPIs with physical destinations, KVM will OR the source's
> > vCPU APIC group with the ICR's destination ID and use that to resolve
> > the target lAPIC. The APIC physical map is also made group aware in
> > order to speed up this process. For the sake of simplicity, the logical
> > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> > routing to the slower per-vCPU scan method.
> >
> > This capability serves as a building block to implement virtualisation
> > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> > introduces a para-virtualised switch that allows for guest CPUs to jump
> > into a different execution context, this switches into a different CPU
> > state, lAPIC state, and memory protections. We model this in KVM by
> > using distinct kvm_vcpus for each context. Moreover, execution contexts
> > are hierarchical and its APICs are meant to remain functional even when
> > the context isn't 'scheduled in'. For example, we have to keep track of
> > timers' expirations, and interrupt execution of lesser priority contexts
> > when relevant. Hence the need to alias physical APIC ids, while keeping
> > the ability to target specific execution contexts.
>
>
> A few general remarks on this patch (assuming that we don't go with
> the approach of a VM per VTL, in which case this patch is not needed)
>
> -> This feature has to be done in the kernel because vCPUs sharing same VTL,
>    will have same APIC ID.
>    (In addition to that, APIC state is private to a VTL so each VTL
>    can even change its apic id).
>
>    Because of this KVM has to have at least some awareness of this.
>
> -> APICv/AVIC should be supported with VTL eventually:
>    This is thankfully possible by having separate physid/pid tables per VTL,
>    and will mostly just work but needs KVM awareness.
>
> -> I am somewhat against reserving bits in apic id, because that will limit
>    the number of apic id bits available to userspace. Currently this is not
>    a problem but it might be in the future if for some reason the userspace
>    will want an apic id with high bits set.
>
>    But still things change, and with this being part of KVM's ABI, it might backfire.
>    A better idea IMHO is just to have 'APIC namespaces', which like say PID namespaces,
>    such as each namespace is isolated IPI wise on its own, and let each vCPU belong to
>    a one namespace.
>
>    In fact Intel's PRM has a brief mention of a 'hierarchical cluster' mode in which
>    roughly describes this situation in which there are multiple not interconnected APIC buses,
>    and communication between them needs a 'cluster manager device'
>
>    However I don't think that we need an explicit pairs of vCPUs and VTL awareness in the kernel
>    all of this I think can be done in userspace.
>
>    TL;DR: Lets have APIC namespace. a vCPU can belong to a single namespace, and all vCPUs
>    in a namespace send IPIs to each other and know nothing about vCPUs from other namespace.
>
>    A vCPU sending IPI to a different VTL thankfully can only do this using a hypercall,
>    and thus can be handled in the userspace.
>
>
> Overall though IMHO the approach of a VM per VTL is better unless some show stoppers show up.
> If we go with a VM per VTL, we gain APIC namespaces for free, together with AVIC support and
> such.

Thanks, for the thorough review! I took note of all your design comments
(here and in subsequent patches).

I agree that the way to go is the VM per VTL approach. I'll prepare a
PoC as soon as I'm back from the holidays and share my results.

Nicolas