On 2023/12/5 17:19, Roger Pau Monné wrote: > On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote: >> On Mon, 4 Dec 2023, Roger Pau Monné wrote: >>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote: >>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote: >>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote: >>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote: >>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote: >>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote: >>>>>>>>> This patch is to solve two problems we encountered when we try to >>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0. >>>>>>>>> >>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device >>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0, >>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call >>>>>>>>> into Xen and check whether dom0 has the mapping. See >>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH >>>>>>>>> dom0 and it return irq is 0, and then return -EPERM. >>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq >>>>>>>>> when thay are enabled. >>>>>>>>> >>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get >>>>>>>>> registered, but gsi must be configured for it to be able to be >>>>>>>>> mapped into a domU. >>>>>>>>> >>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be >>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when >>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems >>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be >>>>>>>>> unmasked. >>>>>>>>> >>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we >>>>>>>>> assign a device to be passthrough. So that the gsi can get registered >>>>>>>>> and mapped in PVH dom0. >>>>>>>> >>>>>>>> >>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do >>>>>>>> we need the unmask check in Xen? Couldn't we just do: >>>>>>>> >>>>>>>> >>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c >>>>>>>> index 4e40d3609a..df262a4a18 100644 >>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c >>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c >>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent( >>>>>>>> hvm_dpci_eoi(d, gsi); >>>>>>>> } >>>>>>>> >>>>>>>> - if ( is_hardware_domain(d) && unmasked ) >>>>>>>> + if ( is_hardware_domain(d) ) >>>>>>>> { >>>>>>>> /* >>>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock >>>>>>> >>>>>>> There are some issues with this approach. >>>>>>> >>>>>>> mp_register_gsi() will only setup the trigger and polarity of the >>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order >>>>>>> to assert that the configuration is the intended one. A guest is >>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but >>>>>>> that doesn't take effect unless the pin is unmasked. >>>>>>> >>>>>>> Overall the question would be whether we have any guarantees that >>>>>>> the hardware domain has properly configured the pin, even if it's not >>>>>>> using it itself (as it hasn't been unmasked). >>>>>>> >>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we >>>>>>> could configure any pins that are not setup at bind time? >>>>>> >>>>>> That could work. >>>>>> >>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at >>>>>> bind time? That might be enough to pass a pirq_access_permitted check. >>>>> >>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq >>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the >>>>> parameter would be a GSI instead of a previously mapped IRQ). Such >>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that >>>>> route I would recommend that we instead introduce a new dmop that has >>>>> this syntax regardless of the domain type it's called from. >>>> >>>> Looking at the code it is certainly a bit confusing. My point was that >>>> we don't need to wait until polarity and trigger are set appropriately >>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen >>>> should be able to figure out that Dom0 is permitted pirq access. >>> >>> The logic is certainly not straightforward, and it could benefit from >>> some comments. >>> >>> The irq permissions are a bit special, in that they get setup when the >>> IRQ is mapped. >>> >>> The problem however is not so much with IRQ permissions, that we can >>> indeed sort out internally in Xen. Such check in dom0 has the side >>> effect of preventing the IRQ from being assigned to a domU without the >>> hardware source being properly configured AFAICT. >> >> Now I understand why you made a comment previously about Xen having to >> configure trigger and polarity for these interrupts on its own. >> >> >>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier >>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or >>>> polarity to be configured to work. But the suggestion of doing it a >>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea. >>>> >>>> But maybe we can find another location, maybe within >>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before >>>> trigger and polarity are set and before the interrupt is unmasked. >>>> >>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the >>>> call to allocate_and_map_gsi_pirq, because by the time >>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had >>>> already been done. >>> >>> But then we would end up in a situation where the >>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be >>> configured, which I think it's not what we want. >>> >>> One option would be to allow mp_register_gsi() to be called multiple >>> times, and update the IO-APIC pin configuration as long as the pin is >>> not unmasked. That would propagate each dom0 RTE update to the >>> underlying IO-APIC. However such approach relies on dom0 configuring >>> all possible IO-APIC pins, even if no device on dom0 is using them, I >>> think it's not a very reliable option. >>> >>> Another option would be to modify the toolstack to setup the GSI >>> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous >>> email, since we only care about PCI device passthrough the legacy INTx >>> should always be level triggered and low polarity. >>> >>>> I am not familiar with vioapic.c but to give you an idea of what I was >>>> thinking: >>>> >>>> >>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c >>>> index 4e40d3609a..16d56fe851 100644 >>>> --- a/xen/arch/x86/hvm/vioapic.c >>>> +++ b/xen/arch/x86/hvm/vioapic.c >>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig, >>>> return ret; >>>> } >>>> >>>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq); >>>> - if ( ret ) >>>> - { >>>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n", >>>> - gsi, ret); >>>> - return ret; >>>> - } >>>> - >>>> pcidevs_lock(); >>>> ret = pt_irq_create_bind(currd, &pt_irq_bind); >>>> if ( ret ) >>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent( >>>> hvm_dpci_eoi(d, gsi); >>>> } >>>> >>>> + if ( is_hardware_domain(d) ) >>>> + { >>>> + int pirq = gsi, ret; >>>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq); >>>> + if ( ret ) >>>> + { >>>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n", >>>> + gsi, ret); >>>> + return ret; >>>> + } >>>> + } >>>> if ( is_hardware_domain(d) && unmasked ) >>>> { >>>> /* >>> >>> As said above, such approach relies on dom0 writing to the IO-APIC RTE >>> of likely each IO-APIC pin, which is IMO not quite reliable. In there >>> are two different issues here that need to be fixed for PVH dom0: >>> >>> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to >>> succeed for a PVH dom0, even if dom0 is not using the GSI itself. >> >> Yes makes sense >> >> >>> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using >>> the IO-APIC pin itself. >>> >>> First one needs to be fixed internally in Xen, second one will require >>> the toolstack to issue an extra hypercall in order to ensure the >>> IO-APIC pin is properly configured. >> >> On ARM, Xen doesn't need to wait for dom0 to configure interrupts >> correctly. Xen configures them all on its own at boot based on Device >> Tree information. I guess it is not possible to do the same on x86? > > No, not exactly. There's some interrupt information in the ACPI MADT, > but that's just for very specific sources (Interrupt Source Override > Structures) > > Then on AML devices can have resource descriptors that contain > information about how interrupts are setup. However Xen is not able > to read any of this information on AML. > > Legacy PCI interrupts are (always?) level triggered and low polarity, > because it's assumed that an interrupt source can be shared between > multiple devices. > > I'm however not able to find any reference to this in the PCI spec, > hence I'm reluctant to take this for granted in Xen, and default all > GSIs >= 16 to such mode. > > OTOH legacy PCI interrupts are not that used anymore, as almost all > devices will support MSI(-X) (because PCIe mandates it) and OSes > should prefer the latter. SR-IOV VF don't even support legacy PCI > interrupts anymore. > >> If >> not, then I can see why we would need 1 extra toolstack hypercall for >> that (or to bundle the operation of configuring IO-APIC pins together >> with an existing toolstack hypercall). > > One suitable compromise would be to default unconfigured GSIs >= 16 to > level-triggered and low-polarity, as I would expect that to work in > almost all cases. We can always introduce the usage of > PHYSDEVOP_setup_gsi later if required. > > Maybe Jan has more input here, would you agree to defaulting non-ISA > GSIs to level-triggered, low-polarity in the absence of a specific > setup provided by dom0? > > Thanks, Roger. No intention to disturb if I am incorrect, just a little input. On dom0 PVH, when it enables devices, it will call acpi_pci_irq_enable, and in that function, its default trigger is level and polarity is low for pci interrupt. -- Best regards, Jiqian Chen.