Patch "PCI: hv: Fix hibernation in case interrupts are not re-created" has been added to the 5.8-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Mon, 26 Oct 2020 01:31:29 -0400

This is a note to let you know that I've just added the patch titled

    PCI: hv: Fix hibernation in case interrupts are not re-created

to the 5.8-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     pci-hv-fix-hibernation-in-case-interrupts-are-not-re.patch
and it can be found in the queue-5.8 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 019d074ad3352110d2c55f4c73a9c402ac067f6b
Author: Dexuan Cui <decui@xxxxxxxxxxxxx>
Date:   Fri Oct 2 01:51:58 2020 -0700

    PCI: hv: Fix hibernation in case interrupts are not re-created
    
    [ Upstream commit 915cff7f38c5e4d47f187f8049245afc2cb3e503 ]
    
    pci_restore_msi_state() directly writes the MSI/MSI-X related registers
    via MMIO. On a physical machine, this works perfectly; for a Linux VM
    running on a hypervisor, which typically enables IOMMU interrupt remapping,
    the hypervisor usually should trap and emulate the MMIO accesses in order
    to re-create the necessary interrupt remapping table entries in the IOMMU,
    otherwise the interrupts can not work in the VM after hibernation.
    
    Hyper-V is different from other hypervisors in that it does not trap and
    emulate the MMIO accesses, and instead it uses a para-virtualized method,
    which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
    of the info that would be passed to the hypervisor in the case of the
    trap-and-emulate method. This is not an issue to a lot of PCI device
    drivers, which destroy and re-create the interrupts across hibernation, so
    hv_compose_msi_msg() is called automatically. However, some PCI device
    drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
    proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
    across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
    otherwise the PCI device drivers can no longer receive interrupts after
    the VM resumes from hibernation.
    
    Hyper-V is also different in that chip->irq_unmask() may fail in a
    Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
    can not fail because unmasking an MSI/MSI-X register just means an MMIO
    write): during hibernation, when a CPU is offlined, the kernel tries
    to move the interrupt to the remaining CPUs that haven't been offlined
    yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
    because the vmbus channel has been closed: here the early "return" in
    hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
    desc->masked remains "true", so later after hibernation, the MSI interrupt
    always remains masked, which is incorrect. Refer to cpu_disable_common()
    -> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():
    
    static bool migrate_one_irq(struct irq_desc *desc)
    {
    ...
            if (maskchip && chip->irq_mask)
                    chip->irq_mask(d);
    ...
            err = irq_do_set_affinity(d, affinity, false);
    ...
            if (maskchip && chip->irq_unmask)
                    chip->irq_unmask(d);
    
    Fix the issue by calling pci_msi_unmask_irq() unconditionally in
    hv_irq_unmask(). Also suppress the error message for hibernation because
    the hypercall failure during hibernation does not matter (at this time
    all the devices have been frozen). Note: the correct affinity info is
    still updated into the irqdata data structure in migrate_one_irq() ->
    irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
    resumes, hv_pci_restore_msi_state() is able to correctly restore
    the interrupt with the correct affinity.
    
    Link: https://lore.kernel.org/r/20201002085158.9168-1-decui@xxxxxxxxxxxxx
    Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
    Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
    Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx>
    Reviewed-by: Jake Oshins <jakeo@xxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index bf40ff09c99d6..95c04b0ffeb16 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1275,11 +1275,25 @@ static void hv_irq_unmask(struct irq_data *data)
 exit_unlock:
 	spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
 
-	if (res) {
+	/*
+	 * During hibernation, when a CPU is offlined, the kernel tries
+	 * to move the interrupt to the remaining CPUs that haven't
+	 * been offlined yet. In this case, the below hv_do_hypercall()
+	 * always fails since the vmbus channel has been closed:
+	 * refer to cpu_disable_common() -> fixup_irqs() ->
+	 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
+	 *
+	 * Suppress the error message for hibernation because the failure
+	 * during hibernation does not matter (at this time all the devices
+	 * have been frozen). Note: the correct affinity info is still updated
+	 * into the irqdata data structure in migrate_one_irq() ->
+	 * irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
+	 * resumes, hv_pci_restore_msi_state() is able to correctly restore
+	 * the interrupt with the correct affinity.
+	 */
+	if (res && hbus->state != hv_pcibus_removing)
 		dev_err(&hbus->hdev->device,
 			"%s() failed: %#llx", __func__, res);
-		return;
-	}
 
 	pci_msi_unmask_irq(data);
 }
@@ -3368,6 +3382,34 @@ static int hv_pci_suspend(struct hv_device *hdev)
 	return 0;
 }
 
+static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
+{
+	struct msi_desc *entry;
+	struct irq_data *irq_data;
+
+	for_each_pci_msi_entry(entry, pdev) {
+		irq_data = irq_get_irq_data(entry->irq);
+		if (WARN_ON_ONCE(!irq_data))
+			return -EINVAL;
+
+		hv_compose_msi_msg(irq_data, &entry->msg);
+	}
+
+	return 0;
+}
+
+/*
+ * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
+ * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
+ * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
+ * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
+ * Table entries.
+ */
+static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
+{
+	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
+}
+
 static int hv_pci_resume(struct hv_device *hdev)
 {
 	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
@@ -3401,6 +3443,8 @@ static int hv_pci_resume(struct hv_device *hdev)
 
 	prepopulate_bars(hbus);
 
+	hv_pci_restore_msi_state(hbus);
+
 	hbus->state = hv_pcibus_installed;
 	return 0;
 out: