Re: [PATCH 4/4] PCI: quirk Atheros AR93xx to avoid bus reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Alex!

Alex Williamson wrote:
> On Mon, 2015-01-12 at 16:20 +0100, Andreas Hartmann wrote:
>> Alex Williamson wrote:
>>> On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
>>>> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
>>>>> Reports against the TL-WDN4800 card indicate that PCI bus reset of
>>>>> this Atheros device cause system lock-ups and resets.  I've also
>>>>> been able to confirm this behavior on multiple systems.  The device
>>>>> never returns from reset and attempts to access config space of the
>>>>> device after reset result in hangs.  Blacklist bus reset for the
>>>>> device to avoid this issue.
>>>>>
>>>>> Reported-by: Andreas Hartmann <andihartmann@xxxxxxxxxx>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>>>>> Tested-by: Andreas Hartmann <andihartmann@xxxxxxxxxx>
>>>>
>>>> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
>>>> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
>>>>
>>>> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
>>>> Do they fix a regression?  Is there a pointer to a bugzilla or problem
>>>> report about that issue?
>>>>
>>>> I don't understand the connection between 425c1b223dac and
>>>> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
>>>> resets.  Is that the wrong commit, or can you outline the connection for
>>>> me?
>>>
>>> TBH, I don't have a lot of faith in associating this to 425c1b223dac,
>>> I'm not sure how Andreas' bisect landed there. 
>>
>> Because removing this patch made it working again :-)
>>
>> And too:
>> http://thread.gmane.org/gmane.linux.kernel.pci/35170/focus=35984
>>
>> Kernel 2.10. and 2.12. and 2.13. did work fine for me. 2.14 is the first
>> kernel, which hangs the machine at startup of the VM. The userland
>> (qemu) didn't change in between.
> 
> s/2\./3\./

Thanks :-) It seems I don't like the number 3 :-)

> Ok, so what about VC save/restore (425c1b223dac) is the problem then?
> When we tried to determine that, you found that if we continue from the
> top of the save loop, everything works (ie. no VC state saved), but if
> you continue after the variable declaration of the same loop (ie. still
> no VC state saved), it breaks:
> 
> http://www.spinics.net/lists/linux-pci/msg36166.html
> 
> So, please forgive me if I don't have a whole lot of faith that
> 425c1b223dac is involved.

It's hard for me, too. Really. It's kind of mystique.

> We also both independently determined that this particular device never
> recovers from a PCI bus reset, even when done from userspace with setpci
> and absolutely no save/restore wrappers.

Yes.

>  Config space on the device is
> never accessible after the reset.

Yes.

>  Therefore, how could any sort of bus
> reset with save/restore ever work for this device?

I can't say. What I definitely can say, is that I never had problems
with running VMs w/ qemu until 3.14 came up. Do you think I'm lying? I
used 3.10. and 3.12. for long time w/o (known!) problems (3.12 only on
first start of VM). Otherwise I would have been here long time before :-))).

>> Therefore: from my point of view, it is a regression, because things
>> have been working < 2.14.
>>
>> Besides that: It is undoubted, that there is a problem with resetting
>> this card. But the difference between >= 3.14 and < 3.14 is, that < 3.14
>> has been working nevertheless. The patch
>> 425c1b223dac456d00a61fd6b451b6d1cf00d065 obviously changed something
>> which I can't say and I don't know off. Therefore, the quirk-patch is
>> definitely required, because things work completely fine again w/ this
>> patch.
>>
>> "Working" means for me here: I was able to start (and use) the VM w/o
>> crashing the machine and this isn't possible w/ unpatched 2.14+ any
>> more. Yes, w/ 2.12, I wasn't able to restart the VM (it then crashed the
>> machine), but w/ 2.10 even this was possible.
> 
> What?!  So v3.12 still had a machine crash when assigning this device.

Yes. If you *re*start the VM (long time, I didn't knew that fact at all
- I just discovered it during testing while analyzing the problem :-)).
The first start (after reboot) was not a problem. This was the usual use
case here :-)).

Believe me, I'm really convinced that this card does have a problem with
resets. I'm just wondering why it had worked for me until 3.13. That's all.

> The vfio hot reset interface was added in v3.12, so v3.10 didn't have
> any way to do a reset other than what pci_reset_function() decided to
> do.  That all seems to associate the machine crash to the ability to do
> a bus reset on the device.  I'm not sure why the behavior changed
> between v3.14 and v3.12 (maybe the try-reset addition), but there's some
> sort of pre-existing issue before we even got to 425c1b223dac.

Most probably.

> I'm perfectly happy tagging this for stable,

Thanks!! I'm really very comfortable with your patch and your support!
Really! Thanks a lot! It's just odd for me, why it partly worked (first
start of VM worked) w/ 3.12 and 3.13 and 3.14 suddenly no more at all.

You have been accidentally the sufferer - most probably it could have
hit any other change, too. Sorry for that :-(. Therefore: kudos for
anyway fixing the problem. This is really not a matter of course at all!

> but it seems like a
> hardware bug exposed by allowing userspace the ability to select a bus
> reset.  Whether or not that's a kernel regression isn't exactly clear to
> me ("new functionality exposes broken hardware, news at 11").  Thanks,
> 
> Alex


Kind regards,
Andreas

>>> IME, this device cannot,
>>> and has never been able to handle a bus reset.  A simple setpci
>>> experiment on the commandline can confirm this.  What I think happened
>>> is that with the PCI bus reset infrastructure we added, we switched QEMU
>>> to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
>>> just more prolific use of bus resets by userspace.
>>>
>>> There's also no regression in 1 & 2, PM reset has never done anything
>>> useful on those devices.  Thanks,
>>>
>>> Alex
>>>
>>>>> ---
>>>>>
>>>>>  drivers/pci/quirks.c |   14 ++++++++++++++
>>>>>  1 file changed, 14 insertions(+)
>>>>>
>>>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>>>> index 561e10d..ebbd5b4 100644
>>>>> --- a/drivers/pci/quirks.c
>>>>> +++ b/drivers/pci/quirks.c
>>>>> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>>>>>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>>>>>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>>>>>  
>>>>> +static void quirk_no_bus_reset(struct pci_dev *dev)
>>>>> +{
>>>>> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
>>>>> + * throw a Link Down error on AER capable system and regardless of AER,
>>>>> + * config space of the device is never accessible again and typically
>>>>> + * causes the system to hang or reset when access is attempted.
>>>>> + * http://www.spinics.net/lists/linux-pci/msg34797.html
>>>>> + */
>>>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
>>>>> +
>>>>>  #ifdef CONFIG_ACPI
>>>>>  /*
>>>>>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
>>>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux