Hi Maxim, On Tue Nov 28, 2023 at 6:56 AM UTC, Maxim Levitsky wrote: > On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote: > > From: Anel Orazgaliyeva <anelkz@xxxxxxxxx> > > > > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC > > ids into two. The lower bits, the physical APIC id, represent the part > > that's exposed to the guest. The higher bits, which are private to KVM, > > groups APICs together. APICs in different groups are isolated from each > > other, and IPIs can only be directed at APICs that share the same group > > as its source. Furthermore, groups are only relevant to IPIs, anything > > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or > > PV-IPIs is targeted at the default APIC group, group 0. > > > > When routing IPIs with physical destinations, KVM will OR the source's > > vCPU APIC group with the ICR's destination ID and use that to resolve > > the target lAPIC. The APIC physical map is also made group aware in > > order to speed up this process. For the sake of simplicity, the logical > > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI > > routing to the slower per-vCPU scan method. > > > > This capability serves as a building block to implement virtualisation > > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM > > introduces a para-virtualised switch that allows for guest CPUs to jump > > into a different execution context, this switches into a different CPU > > state, lAPIC state, and memory protections. We model this in KVM by > > using distinct kvm_vcpus for each context. Moreover, execution contexts > > are hierarchical and its APICs are meant to remain functional even when > > the context isn't 'scheduled in'. For example, we have to keep track of > > timers' expirations, and interrupt execution of lesser priority contexts > > when relevant. Hence the need to alias physical APIC ids, while keeping > > the ability to target specific execution contexts. > > > A few general remarks on this patch (assuming that we don't go with > the approach of a VM per VTL, in which case this patch is not needed) > > -> This feature has to be done in the kernel because vCPUs sharing same VTL, > will have same APIC ID. > (In addition to that, APIC state is private to a VTL so each VTL > can even change its apic id). > > Because of this KVM has to have at least some awareness of this. > > -> APICv/AVIC should be supported with VTL eventually: > This is thankfully possible by having separate physid/pid tables per VTL, > and will mostly just work but needs KVM awareness. > > -> I am somewhat against reserving bits in apic id, because that will limit > the number of apic id bits available to userspace. Currently this is not > a problem but it might be in the future if for some reason the userspace > will want an apic id with high bits set. > > But still things change, and with this being part of KVM's ABI, it might backfire. > A better idea IMHO is just to have 'APIC namespaces', which like say PID namespaces, > such as each namespace is isolated IPI wise on its own, and let each vCPU belong to > a one namespace. > > In fact Intel's PRM has a brief mention of a 'hierarchical cluster' mode in which > roughly describes this situation in which there are multiple not interconnected APIC buses, > and communication between them needs a 'cluster manager device' > > However I don't think that we need an explicit pairs of vCPUs and VTL awareness in the kernel > all of this I think can be done in userspace. > > TL;DR: Lets have APIC namespace. a vCPU can belong to a single namespace, and all vCPUs > in a namespace send IPIs to each other and know nothing about vCPUs from other namespace. > > A vCPU sending IPI to a different VTL thankfully can only do this using a hypercall, > and thus can be handled in the userspace. > > > Overall though IMHO the approach of a VM per VTL is better unless some show stoppers show up. > If we go with a VM per VTL, we gain APIC namespaces for free, together with AVIC support and > such. Thanks, for the thorough review! I took note of all your design comments (here and in subsequent patches). I agree that the way to go is the VM per VTL approach. I'll prepare a PoC as soon as I'm back from the holidays and share my results. Nicolas