RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic

Dexuan Cui <decui@xxxxxxxxxxxxx> · Thu, 15 Jun 2023 04:27:45 +0000

> From: Lorenzo Pieralisi <lpieralisi@xxxxxxxxxx>
> Sent: Thursday, May 25, 2023 3:16 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data
> >  *data)
> >  	pbus = pdev->bus;
> >  	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> >  	int_desc = data->chip_data;
> > +	if (!int_desc) {
> > +		dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
> > +			 __func__, data->irq);
> > +		return;
> > +	}
> 
> That's a check that should be there regardless ?
Yes. Normally data->chip_data is set at the end of hv_compose_msi_msg(),
and it should not be NULL. However, in rare circumstances, we might see a
failure in hv_compose_msi_msg(), e.g. the hypervisor/host might return an
error in comp.comp_pkt.completion_status (at least this is possible in theory).

In case we see a failure in hv_compose_msi_msg(), data->chip_data stays
with its default value of NULL; because the return type of
hv_compose_msi_msg() is "void", we can not return an error to the upper
layer; later, when the upper layer calls request_irq() -> ... -> hv_irq_unmask(),
hv_arch_irq_unmask() crashes because data->chip_data is NULL -- with this
check, we're able to error out gracefully, and the user can better understand
what goes wrong.

> >  	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> >
> > @@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct
> irq_data *data, struct msi_msg *msg)
> >  		hv_pci_onchannelcallback(hbus);
> >  		spin_unlock_irqrestore(&channel->sched_lock, flags);
> >
> > -		if (hpdev->state == hv_pcichild_ejecting) {
> > -			dev_err_once(&hbus->hdev->device,
> > -				     "the device is being ejected\n");
> > -			goto enable_tasklet;
> > -		}
> > -
> >  		udelay(100);
> >  	}
> 
> I don't understand why this code is in hv_compose_msi_msg() in the first
> place (and why only in that function ?) to me this looks like you are
> adding plasters in the code that can turn out to be problematic while
> ejecting a device, this does not seem robust at all - that's my opinion.

The code was incorrectly added by
de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

de0aa7b2f97d says

"
2. If the host is ejecting the VF device before we reach
    hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
    forever, because at this time the host doesn't respond to the
    CREATE_INTERRUPT request.
"

de0aa7b2f97d implies that the host doesn't respond to the guest's
CREATE_INTERRUPT request once the guest receives the PCI_EJECT message -- this
is incorrect: after the guest receives the PCI_EJECT message, actually the host
still responds to the guest's request, as long as the guest sends the request within
1 minute AND the guest doesn't send a PCI_EJECTION_COMPLETE message to
the host in hv_eject_device_work(). The real issue is that currently the guest
can send PCI_EJECTION_COMPLETE to the host before the guest finishes the
device-probing/removing handling -- once the guest sends PCI_EJECTION_COMPLETE,
the host unassigns the PCI device from the guest and ignores any request
from the guest.

So here the check "hpdev->state == hv_pcichild_ejecting" is incorrect. We
should remove the check since it can cause a panic (see the commit messsage
for the detailed explanation)

The "premature PCI_EJECTION_COMPLETE" issue is resolved by:
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock

> Feel free to merge this code, I can't ACK it, sorry.
> 
> Lorenzo
Thanks for sharing the thougths!