Re: [PATCH 4/4] PCI: quirk Atheros AR93xx to avoid bus reset

Bjorn Helgaas <bhelgaas@xxxxxxxxxx> · Mon, 12 Jan 2015 18:37:10 -0600

On Mon, Jan 12, 2015 at 1:15 PM, Andreas Hartmann
<andihartmann@xxxxxxxxxx> wrote:
> Hello Alex!
>
> Alex Williamson wrote:
>> On Mon, 2015-01-12 at 16:20 +0100, Andreas Hartmann wrote:
>>> Alex Williamson wrote:
>>>> On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
>>>>> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
>>>>>> Reports against the TL-WDN4800 card indicate that PCI bus reset of
>>>>>> this Atheros device cause system lock-ups and resets.  I've also
>>>>>> been able to confirm this behavior on multiple systems.  The device
>>>>>> never returns from reset and attempts to access config space of the
>>>>>> device after reset result in hangs.  Blacklist bus reset for the
>>>>>> device to avoid this issue.
>>>>>>
>>>>>> Reported-by: Andreas Hartmann <andihartmann@xxxxxxxxxx>
>>>>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>>>>>> Tested-by: Andreas Hartmann <andihartmann@xxxxxxxxxx>
>>>>>
>>>>> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
>>>>> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
>>>>>
>>>>> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
>>>>> Do they fix a regression?  Is there a pointer to a bugzilla or problem
>>>>> report about that issue?
>>>>>
>>>>> I don't understand the connection between 425c1b223dac and
>>>>> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
>>>>> resets.  Is that the wrong commit, or can you outline the connection for
>>>>> me?
>>>>
>>>> TBH, I don't have a lot of faith in associating this to 425c1b223dac,
>>>> I'm not sure how Andreas' bisect landed there.
>>>
>>> Because removing this patch made it working again :-)
>>>
>>> And too:
>>> http://thread.gmane.org/gmane.linux.kernel.pci/35170/focus=35984
>>>
>>> Kernel 2.10. and 2.12. and 2.13. did work fine for me. 2.14 is the first
>>> kernel, which hangs the machine at startup of the VM. The userland
>>> (qemu) didn't change in between.
>>
>> s/2\./3\./
>
> Thanks :-) It seems I don't like the number 3 :-)
>
>> Ok, so what about VC save/restore (425c1b223dac) is the problem then?
>> When we tried to determine that, you found that if we continue from the
>> top of the save loop, everything works (ie. no VC state saved), but if
>> you continue after the variable declaration of the same loop (ie. still
>> no VC state saved), it breaks:
>>
>> http://www.spinics.net/lists/linux-pci/msg36166.html
>>
>> So, please forgive me if I don't have a whole lot of faith that
>> 425c1b223dac is involved.
>
> It's hard for me, too. Really. It's kind of mystique.
>
>> We also both independently determined that this particular device never
>> recovers from a PCI bus reset, even when done from userspace with setpci
>> and absolutely no save/restore wrappers.
>
> Yes.
>
>>  Config space on the device is
>> never accessible after the reset.
>
> Yes.
>
>>  Therefore, how could any sort of bus
>> reset with save/restore ever work for this device?
>
> I can't say. What I definitely can say, is that I never had problems
> with running VMs w/ qemu until 3.14 came up. Do you think I'm lying? I
> used 3.10. and 3.12. for long time w/o (known!) problems (3.12 only on
> first start of VM). Otherwise I would have been here long time before :-))).
>
>>> Therefore: from my point of view, it is a regression, because things
>>> have been working < 2.14.
>>>
>>> Besides that: It is undoubted, that there is a problem with resetting
>>> this card. But the difference between >= 3.14 and < 3.14 is, that < 3.14
>>> has been working nevertheless. The patch
>>> 425c1b223dac456d00a61fd6b451b6d1cf00d065 obviously changed something
>>> which I can't say and I don't know off. Therefore, the quirk-patch is
>>> definitely required, because things work completely fine again w/ this
>>> patch.
>>>
>>> "Working" means for me here: I was able to start (and use) the VM w/o
>>> crashing the machine and this isn't possible w/ unpatched 2.14+ any
>>> more. Yes, w/ 2.12, I wasn't able to restart the VM (it then crashed the
>>> machine), but w/ 2.10 even this was possible.
>>
>> What?!  So v3.12 still had a machine crash when assigning this device.
>
> Yes. If you *re*start the VM (long time, I didn't knew that fact at all
> - I just discovered it during testing while analyzing the problem :-)).
> The first start (after reboot) was not a problem. This was the usual use
> case here :-)).
>
> Believe me, I'm really convinced that this card does have a problem with
> resets. I'm just wondering why it had worked for me until 3.13. That's all.
>
>> The vfio hot reset interface was added in v3.12, so v3.10 didn't have
>> any way to do a reset other than what pci_reset_function() decided to
>> do.  That all seems to associate the machine crash to the ability to do
>> a bus reset on the device.  I'm not sure why the behavior changed
>> between v3.14 and v3.12 (maybe the try-reset addition), but there's some
>> sort of pre-existing issue before we even got to 425c1b223dac.
>
> Most probably.
>
>> I'm perfectly happy tagging this for stable,
>
> Thanks!! I'm really very comfortable with your patch and your support!
> Really! Thanks a lot! It's just odd for me, why it partly worked (first
> start of VM worked) w/ 3.12 and 3.13 and 3.14 suddenly no more at all.
>
> You have been accidentally the sufferer - most probably it could have
> hit any other change, too. Sorry for that :-(. Therefore: kudos for
> anyway fixing the problem. This is really not a matter of course at all!

So we should be able to add instrumentation to the reset paths in
425c1b223dac and 425c1b223dac^ and see some difference in how those
paths are exercised.  Right?

It still feels like there's some magic we don't understand here, and
that niggles at me.

Bjorn

>> but it seems like a
>> hardware bug exposed by allowing userspace the ability to select a bus
>> reset.  Whether or not that's a kernel regression isn't exactly clear to
>> me ("new functionality exposes broken hardware, news at 11").  Thanks,
>>
>> Alex
>
>
> Kind regards,
> Andreas
>
>>>> IME, this device cannot,
>>>> and has never been able to handle a bus reset.  A simple setpci
>>>> experiment on the commandline can confirm this.  What I think happened
>>>> is that with the PCI bus reset infrastructure we added, we switched QEMU
>>>> to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
>>>> just more prolific use of bus resets by userspace.
>>>>
>>>> There's also no regression in 1 & 2, PM reset has never done anything
>>>> useful on those devices.  Thanks,
>>>>
>>>> Alex
>>>>
>>>>>> ---
>>>>>>
>>>>>>  drivers/pci/quirks.c |   14 ++++++++++++++
>>>>>>  1 file changed, 14 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>>>>> index 561e10d..ebbd5b4 100644
>>>>>> --- a/drivers/pci/quirks.c
>>>>>> +++ b/drivers/pci/quirks.c
>>>>>> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>>>>>>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>>>>>>                          PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>>>>>>
>>>>>> +static void quirk_no_bus_reset(struct pci_dev *dev)
>>>>>> +{
>>>>>> + dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
>>>>>> + * throw a Link Down error on AER capable system and regardless of AER,
>>>>>> + * config space of the device is never accessible again and typically
>>>>>> + * causes the system to hang or reset when access is attempted.
>>>>>> + * http://www.spinics.net/lists/linux-pci/msg34797.html
>>>>>> + */
>>>>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
>>>>>> +
>>>>>>  #ifdef CONFIG_ACPI
>>>>>>  /*
>>>>>>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html