On 2023/12/5 18:32, Jan Beulich wrote: > On 05.12.2023 10:19, Roger Pau Monné wrote: >> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote: >>> On Mon, 4 Dec 2023, Roger Pau Monné wrote: >>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote: >>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote: >>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote: >>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote: >>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote: >>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote: >>>>>>>>>> This patch is to solve two problems we encountered when we try to >>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0. >>>>>>>>>> >>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device >>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0, >>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call >>>>>>>>>> into Xen and check whether dom0 has the mapping. See >>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH >>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM. >>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq >>>>>>>>>> when thay are enabled. >>>>>>>>>> >>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get >>>>>>>>>> registered, but gsi must be configured for it to be able to be >>>>>>>>>> mapped into a domU. >>>>>>>>>> >>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be >>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when >>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems >>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be >>>>>>>>>> unmasked. >>>>>>>>>> >>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we >>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered >>>>>>>>>> and mapped in PVH dom0. >>>>>>>>> >>>>>>>>> >>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do >>>>>>>>> we need the unmask check in Xen? Couldn't we just do: >>>>>>>>> >>>>>>>>> >>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c >>>>>>>>> index 4e40d3609a..df262a4a18 100644 >>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c >>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c >>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent( >>>>>>>>> hvm_dpci_eoi(d, gsi); >>>>>>>>> } >>>>>>>>> >>>>>>>>> - if ( is_hardware_domain(d) && unmasked ) >>>>>>>>> + if ( is_hardware_domain(d) ) >>>>>>>>> { >>>>>>>>> /* >>>>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock >>>>>>>> >>>>>>>> There are some issues with this approach. >>>>>>>> >>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the >>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order >>>>>>>> to assert that the configuration is the intended one. A guest is >>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but >>>>>>>> that doesn't take effect unless the pin is unmasked. >>>>>>>> >>>>>>>> Overall the question would be whether we have any guarantees that >>>>>>>> the hardware domain has properly configured the pin, even if it's not >>>>>>>> using it itself (as it hasn't been unmasked). >>>>>>>> >>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we >>>>>>>> could configure any pins that are not setup at bind time? >>>>>>> >>>>>>> That could work. >>>>>>> >>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at >>>>>>> bind time? That might be enough to pass a pirq_access_permitted check. >>>>>> >>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq >>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the >>>>>> parameter would be a GSI instead of a previously mapped IRQ). Such >>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that >>>>>> route I would recommend that we instead introduce a new dmop that has >>>>>> this syntax regardless of the domain type it's called from. >>>>> >>>>> Looking at the code it is certainly a bit confusing. My point was that >>>>> we don't need to wait until polarity and trigger are set appropriately >>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen >>>>> should be able to figure out that Dom0 is permitted pirq access. >>>> >>>> The logic is certainly not straightforward, and it could benefit from >>>> some comments. >>>> >>>> The irq permissions are a bit special, in that they get setup when the >>>> IRQ is mapped. >>>> >>>> The problem however is not so much with IRQ permissions, that we can >>>> indeed sort out internally in Xen. Such check in dom0 has the side >>>> effect of preventing the IRQ from being assigned to a domU without the >>>> hardware source being properly configured AFAICT. >>> >>> Now I understand why you made a comment previously about Xen having to >>> configure trigger and polarity for these interrupts on its own. >>> >>> >>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier >>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or >>>>> polarity to be configured to work. But the suggestion of doing it a >>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea. >>>>> >>>>> But maybe we can find another location, maybe within >>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before >>>>> trigger and polarity are set and before the interrupt is unmasked. >>>>> >>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the >>>>> call to allocate_and_map_gsi_pirq, because by the time >>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had >>>>> already been done. >>>> >>>> But then we would end up in a situation where the >>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be >>>> configured, which I think it's not what we want. >>>> >>>> One option would be to allow mp_register_gsi() to be called multiple >>>> times, and update the IO-APIC pin configuration as long as the pin is >>>> not unmasked. That would propagate each dom0 RTE update to the >>>> underlying IO-APIC. However such approach relies on dom0 configuring >>>> all possible IO-APIC pins, even if no device on dom0 is using them, I >>>> think it's not a very reliable option. >>>> >>>> Another option would be to modify the toolstack to setup the GSI >>>> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous >>>> email, since we only care about PCI device passthrough the legacy INTx >>>> should always be level triggered and low polarity. >>>> >>>>> I am not familiar with vioapic.c but to give you an idea of what I was >>>>> thinking: >>>>> >>>>> >>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c >>>>> index 4e40d3609a..16d56fe851 100644 >>>>> --- a/xen/arch/x86/hvm/vioapic.c >>>>> +++ b/xen/arch/x86/hvm/vioapic.c >>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig, >>>>> return ret; >>>>> } >>>>> >>>>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq); >>>>> - if ( ret ) >>>>> - { >>>>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n", >>>>> - gsi, ret); >>>>> - return ret; >>>>> - } >>>>> - >>>>> pcidevs_lock(); >>>>> ret = pt_irq_create_bind(currd, &pt_irq_bind); >>>>> if ( ret ) >>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent( >>>>> hvm_dpci_eoi(d, gsi); >>>>> } >>>>> >>>>> + if ( is_hardware_domain(d) ) >>>>> + { >>>>> + int pirq = gsi, ret; >>>>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq); >>>>> + if ( ret ) >>>>> + { >>>>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n", >>>>> + gsi, ret); >>>>> + return ret; >>>>> + } >>>>> + } >>>>> if ( is_hardware_domain(d) && unmasked ) >>>>> { >>>>> /* >>>> >>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE >>>> of likely each IO-APIC pin, which is IMO not quite reliable. In there >>>> are two different issues here that need to be fixed for PVH dom0: >>>> >>>> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to >>>> succeed for a PVH dom0, even if dom0 is not using the GSI itself. >>> >>> Yes makes sense >>> >>> >>>> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using >>>> the IO-APIC pin itself. >>>> >>>> First one needs to be fixed internally in Xen, second one will require >>>> the toolstack to issue an extra hypercall in order to ensure the >>>> IO-APIC pin is properly configured. >>> >>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts >>> correctly. Xen configures them all on its own at boot based on Device >>> Tree information. I guess it is not possible to do the same on x86? >> >> No, not exactly. There's some interrupt information in the ACPI MADT, >> but that's just for very specific sources (Interrupt Source Override >> Structures) >> >> Then on AML devices can have resource descriptors that contain >> information about how interrupts are setup. However Xen is not able >> to read any of this information on AML. >> >> Legacy PCI interrupts are (always?) level triggered and low polarity, >> because it's assumed that an interrupt source can be shared between >> multiple devices. > > Except that as per what you said just in the earlier paragraph ACPI can > tell us otherwise. > >> I'm however not able to find any reference to this in the PCI spec, >> hence I'm reluctant to take this for granted in Xen, and default all >> GSIs >= 16 to such mode. >> >> OTOH legacy PCI interrupts are not that used anymore, as almost all >> devices will support MSI(-X) (because PCIe mandates it) and OSes >> should prefer the latter. SR-IOV VF don't even support legacy PCI >> interrupts anymore. >> >>> If >>> not, then I can see why we would need 1 extra toolstack hypercall for >>> that (or to bundle the operation of configuring IO-APIC pins together >>> with an existing toolstack hypercall). >> >> One suitable compromise would be to default unconfigured GSIs >= 16 to >> level-triggered and low-polarity, as I would expect that to work in >> almost all cases. We can always introduce the usage of >> PHYSDEVOP_setup_gsi later if required. >> >> Maybe Jan has more input here, would you agree to defaulting non-ISA >> GSIs to level-triggered, low-polarity in the absence of a specific >> setup provided by dom0? > > Well, such defaulting is an option, but in case it's wrong we might > end up with hard to diagnose issues. Personally I'd prefer if we > didn't take shortcuts here, i.e. if we followed what Dom0 is able > to read from ACPI. When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable) I have a version of patch which tried that way, see below: diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c index ada3868c02c2..43e1bda9f946 100644 --- a/arch/x86/xen/enlighten_pvh.c +++ b/arch/x86/xen/enlighten_pvh.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include <linux/acpi.h> #include <linux/export.h> +#include <linux/pci.h> #include <xen/hvc-console.h> @@ -25,6 +26,127 @@ bool __ro_after_init xen_pvh; EXPORT_SYMBOL_GPL(xen_pvh); +typedef struct gsi_info { + int gsi; + int trigger; + int polarity; + int pirq; +} gsi_info_t; + +struct acpi_prt_entry { + struct acpi_pci_id id; + u8 pin; + acpi_handle link; + u32 index; /* GSI, or link _CRS index */ +}; + +static int xen_pvh_get_gsi_info(struct pci_dev *dev, + gsi_info_t *gsi_info) +{ + int gsi; + u8 pin = 0; + struct acpi_prt_entry *entry; + int trigger = ACPI_LEVEL_SENSITIVE; + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ? + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW; + + if (dev) + pin = dev->pin; + if (!pin) { + xen_raw_printk("No interrupt pin configured\n"); + return -EINVAL; + } + + entry = acpi_pci_irq_lookup(dev, pin); + if (entry) { + if (entry->link) + gsi = acpi_pci_link_allocate_irq(entry->link, + entry->index, + &trigger, &polarity, + NULL); + else + gsi = entry->index; + } else + return -EINVAL; + + gsi_info->gsi = gsi; + gsi_info->trigger = trigger; + gsi_info->polarity = polarity; + + return 0; +} + +static int xen_pvh_map_pirq(gsi_info_t *gsi_info) +{ + struct physdev_map_pirq map_irq; + int ret; + + map_irq.domid = DOMID_SELF; + map_irq.type = MAP_PIRQ_TYPE_GSI; + map_irq.index = gsi_info->gsi; + map_irq.pirq = gsi_info->gsi; + + ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq); + gsi_info->pirq = map_irq.pirq; + + return ret; +} + +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info) +{ + struct physdev_unmap_pirq unmap_irq; + + unmap_irq.domid = DOMID_SELF; + unmap_irq.pirq = gsi_info->pirq; + + return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq); +} + +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info) +{ + struct physdev_setup_gsi setup_gsi; + + setup_gsi.gsi = gsi_info->gsi; + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1); + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1); + + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi); +} + +int xen_pvh_passthrough_gsi(struct pci_dev *dev) +{ + int ret; + gsi_info_t gsi_info; + + if (!dev) { + return -EINVAL; + } + + ret = xen_pvh_get_gsi_info(dev, &gsi_info); + if (ret) { + xen_raw_printk("Fail to get gsi info!\n"); + return ret; + } + + ret = xen_pvh_map_pirq(&gsi_info); + if (ret) { + xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi); + return ret; + } + + ret = xen_pvh_setup_gsi(&gsi_info); + if (ret == -EEXIST) { + ret = 0; + xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi); + } else if (ret) { + xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi); + xen_pvh_unmap_pirq(&gsi_info); + } + + return ret; +} +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi); + void __init xen_pvh_init(struct boot_params *boot_params) { u32 msr; diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c index ff30ceca2203..630fe0a34bc6 100644 --- a/drivers/acpi/pci_irq.c +++ b/drivers/acpi/pci_irq.c @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev, } #endif /* CONFIG_X86_IO_APIC */ -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin) +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin) { struct acpi_prt_entry *entry = NULL; struct pci_dev *bridge; diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index e34b623e4b41..1abd4dad6f40 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -20,6 +20,7 @@ #include <linux/atomic.h> #include <xen/events.h> #include <xen/pci.h> +#include <xen/acpi.h> #include <xen/xen.h> #include <asm/xen/hypervisor.h> #include <xen/interface/physdev.h> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev) if (err) goto config_release; + if (xen_initial_domain() && xen_pvh_domain()) { + err = xen_pvh_passthrough_gsi(dev); + if (err) + goto config_release; + } + if (dev->msix_cap) { struct physdev_pci_device ppdev = { .seg = pci_domain_nr(dev->bus), diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 641dc4843987..368d56ba2c5e 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi); struct pci_dev; +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin); int acpi_pci_irq_enable (struct pci_dev *dev); void acpi_penalize_isa_irq(int irq, int active); bool acpi_isa_irq_available(int irq); diff --git a/include/xen/acpi.h b/include/xen/acpi.h index b1e11863144d..ce7f5554f88e 100644 --- a/include/xen/acpi.h +++ b/include/xen/acpi.h @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void) acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel; } } +int xen_pvh_passthrough_gsi(struct pci_dev *dev); #else static inline void xen_acpi_sleep_register(void) { > > Jan -- Best regards, Jiqian Chen.