Kevin, On Fri, Dec 10 2021 at 07:29, Kevin Tian wrote: >> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >> 4) For the guest side we agreed that we need an hypercall because the >> host can't trap the write to the MSI[-X] entry anymore. > > To be accurate I'd like to not call it "can't trap". The host still traps the > MSI/MSI-X entry if the hypercall is not used. This is for current guest > OS which doesn't have this hypercall mechanism. For future new guest > OS which will support this machinery then a handshake process from > such guest will disable the trap for MSI-X and map it for direct guest > access in the fly. Right. What I'm suggesting is a clear cut between the current approach, which obviously needs to be preserved, and a consistent new approach which handles MSI, MSI-X and IMS in the exactly same way. > MSI has to be always trapped although the guest has acquired the right > data/addr pair via the hypercall, since it's a PCI config capability. > >> >> Aside of the fact that this creates a special case for IMS which is >> undesirable in my opinion, it's not really obvious where the >> hypercall should be placed to work for all scenarios so that it can >> also solve the existing issue of silent failures. >> >> 5) It's not possible for the kernel to reliably detect whether it is >> running on bare metal or not. Yes we talked about heuristics, but >> that's something I really want to avoid. > > How would the hypercall mechanism avoid such heuristics? The availability of IR remapping where the irqdomain which is provided by the remapping unit signals that it supports this new scheme: |--IO/APIC |--MSI vector -- IR --|--MSI-X |--IMS while the current scheme is: |--IO/APIC vector -- IR --|--PCI/MSI[-X] or |--IO/APIC vector --------|--PCI/MSI[-X] So in the new scheme the IR domain will advertise new features which are not available on older kernels. The availability of these new features is the indicator for the interrupt subsystem and subsequently for PCI whether IMS is supported or not. Bootup either finds an IR unit or not. In the bare metal case that's the usual hardware/firmware detection. In the guest case it's the availability of vIR including the required hypercall protocol. So for the guest case the initialization would look like this: qemu starts guest Method of interrupt management defaults to current scheme restricted to MSI/MSI-X guest initializes older guest do not check for the hypercall so everything continues as of today new guest initializes vIR, detects hypercall and requests from the hypervisor to switch over to the new scheme. The hypervisor grants or rejects the request, i.e. either both switch to the new scheme or stay with the old one. The new scheme means, that all variants, MSI, MSI-X, IMS are handled in the same way. Staying on the old scheme means that IMS is not available to the guest. Having that clear separation is in my opinion way better than trying to make all of that a big maze of conditionals. I'm going to make that clear cut in the PCI/MSI management layer as well. Trying to do that hybrid we do today for irqdomain and non irqdomain based backends is just not feasible. The decision which way to go will be very close to the driver exposed API, i.e.: pci_alloc_ims_vector() if (new_scheme()) return new_scheme_alloc_ims(); else return -ENOTSUPP; and new_scheme_alloc_ims() will have: new_scheme_alloc_ims() if (!system_supports_ims()) return -ENOTSUPP; .... system_supports_ims() makes its decision based on the feature flags exposed by the underlying base MSI irqdomain, i.e. either vector or IR on x86. Vector domain will not have that feature flag set, but IR will have on bare metal and eventually in the guest when the vIR setup and hypercall detection and negotiation succeeds. > Then Qemu needs to find out the GSI number for the vIRTE handle. > Again Qemu doesn't have such information since it doesn't know > which MSI[-X] entry points to this handle due to no trap. > > This implies that we may also need carry device ID, #msi entry, etc. > in the hypercall, so Qemu can associate the virtual routing info > to the right [irqfd, gsi]. > > In your model the hypercall is raised by IR domain. Do you see > any problem of finding those information within IR domain? IR has the following information available: Interrupt type - MSI: Device, index and number of vectors - MSI-X: Device, index - IMS: Device, index Target APIC/vector pair IMS: The index depends on the storage type: For storage in device memory, e.g. IDXD, it's the array index. For storage in system memory, the index is a software artifact. Does that answer your question? Thanks, tglx