Re: [PATCH] PCI / ACPI: Always resume devices on ACPI wakeup notifications

Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx> · Tue, 02 Apr 2013 22:55:02 +0200

[ +linux-pci and Yinghai as they suffered already those many emails on individual
 threads so one overviewing email hopefully won't harm] ;-)

Martin Mokrejs wrote:
> 
> 
> Bjorn Helgaas wrote:
>> On Tue, Apr 2, 2013 at 9:02 AM, Martin Mokrejs
>> <mmokrejs@xxxxxxxxxxxxxxxxxx> wrote:
>>> Hi Ying,
>>>
>>> huang ying wrote:
>>
>>>> And please give me the full dmesg for boot and incremental dmesg for
>>>> operations.
>>>
>>>
>>> The incremental bits here, the full dmesg will send only directly to your email, due to its size.
>>
>> Is there a bugzilla for this issue?  Please attach the complete dmesg
>> there or somewhere similar so we can all benefit.
> 
> I changed my mind. I am attaching the dmesg here but omitting linux-acpi
> list. After I hear a proposal from Rafel/Bjorn I will open separate bugs.
> I thought that the threads I started so far were enough but yes, dmesg
> files don't pass through list filters so I should move that to bugzilla.
> 
> so far my view of the the bugs was:
> 
> 1) acpiphp hotplug broken due to upstream pcieport 1c.7 PME# enabled
>   (eSATA-based card)

Fixed by Ying Huang port_dbg.patch applied over 3.8.5 (fixes acpiphp hotplug
of eSATA and Firewire cards, NOT the hotplug of a NEC-based USB3 card -> hence
the bug 4) below). Now I can continue using laptop-mode-tools.

> 2) xHCI dead due to to its suspend - 3.8 series and above

Not fixed by port_dbg.patch applied over 3.8.5. Interestingly, a NEC-based
XHCI card *in an express card slot* does not suffer this suspend issue.
Although it is being put into suspend if a device is unplugged.

> 3) pciehp completely broken since about 3.6, still 3.9-rc5

Even 3.9-rc5 with patch 2368081 and port_dbg.patch from Ying Huang this is
still broken (the eject of a cold plugged device from an express card slot).
That results in /proc/interrupts claiming IRQ19 is still used by the driver.
Non-forced but manual 'rmmod sata_sil24' removes the IRQ 19 from the listing.
The rmmod also removes association with sata_sil24 from the /proc/iomem but
the device 11:00 is retained in the file with its memory ranges.
lspci provides, as many times described by me, conflicting information.
Actually, I trust more lspci than /proc/ files.

> 
> 
> 
> There is one more which actually brought me into all of this in May2012 at about
> 3.2.x kernels:
> 
> 4) Even when upstream port 1c.7 is force control to 'on' hot removal of
>    USB3 express card is broken, only every second eject is recognized.
>    Is likely related to xhci_hcd having a special privilege to handle IRQ/PM
>    in its own way. In contrast, Firewire and eSATA cards work under same
>    circumstances. I see different sleep states listed as supported by those
>    cards but my bet is that is due to the exceptional xhci_hcd privilege.
>    I briefly repeated that already with 3.9-rc5.

Still broken even with port_dbg.patch applied over 3.8.5. Turns out the unnoticed
ejects and inserts are actually detected, but later, with 30sec delay of so.
Hmm, in my original thread back in 2012 I said 60sec delay but seems is likely
still the same problem:
3.2.11: PCI Express card cannot be re-detected withing cca 60sec timeframe

Before I forget, I will sketch several more bugs I hit and are all documented
in my postings from last week or two. I can provide the URLs to those postings
already in archives and maybe summarize them in bugzilla, after we agree what
will be worked on and where (email ... bugzilla), under the best matching suibject
you will propose.

5) lspci causes wake and suspend of pcieport handled devices. I fear this is
not good. Maybe it does the same to other pci devices but the "problem" is
that no other pci drivers report same type of message. I would like to see
the PME# enabled/disabled generated by other drivers as well, ideally by some
upstream, common driver.

6) sata_sil24 sometimes initializes badly under pciehp. Provided you once fix
the pciehp and still would like to get the init of sata_sil24 fixed as well.
The are two wrong paths in the driver. One is:

[  899.894862] sata_sil24 0000:11:00.0: version 1.1
[  899.894880] sata_sil24 0000:11:00.0: enabling device (0000 -> 0003)
[  899.985994] sata_sil24 0000:11:00.0: failed to clear port RST
[  900.086097] sata_sil24 0000:11:00.0: failed to clear port RST
[  900.086119] sata_sil24 0000:11:00.0: enabling bus mastering

while the other is:

[  974.021661] pcieport 0000:00:1c.0: PME# disabled
[  974.041697] pcieport 0000:00:1c.7: PME# disabled
[ 1048.450168] sata_sil24 0000:11:00.0: version 1.1
[ 1048.463692] sata_sil24 0000:11:00.0: Refused to change power state, currently in D3
[ 1048.563818] sata_sil24 0000:11:00.0: failed to clear port RST
[ 1048.663935] sata_sil24 0000:11:00.0: failed to clear port RST

Both lead to a broken device and I would prefer the driver to fail to load.
It seems they are at least in part related to early device eject while the
driver did not yet turn down an unused external SATA port.

7) It seems Rafael or Bjorn have a clue why sometimes I see only PME# disabled
or just PME# enabled in dmesg for a particular device and I am worried when was
it silently switched to the other state. I would like to hear this can be prevent
in future by some cross-checks, by design.

8) I don't know whether one can ensure that a driver releases either both
IRQ and memory ranges it has allocated, or just nothing, or an oops happens,
whatever. Maybe something could track what the driver grabbed once and make
sure both are released. even a background scan or /proc files would be fine.
The disagreement with lspci is not good.

9) In the thread 
Re: 3.8.2: stale pci device info for a previously inserted express card
I already showed an example that chimeric entries in 'lspci -vvv' output
can appear. Some data describe the previously loaded card in an Express
Card Slot while the other the one currently loaded in the slot.
This might lead to an explanation why are there those lines in lspci like:

a)
Latency: 0
Latency: 0, Cache Line Size: 64 bytes
or the Latency: line missing altogether

b)
[virtual] Expansion ROM at f6c00000 [disabled] [size=512K]
Expansion ROM at f6c00000 [size=512K]

c)
Region 0: Memory at f6c84000 (64-bit, non-prefetchable) [size=128]
Region 0: Memory at f6c84000 (64-bit, non-prefetchable) [disabled] [size=128]

If kernel does not give a hint what is wrong with a device/driver then
maybe lspci do do a runtime check and give some more useful user-oriented warning.

>>
>> I think we have two problems that may be relevant to this discussion.
>>
>> 1) The _OSC "PCI Express Capability Structure control" bit.  I don't
>> think Linux pays attention to whether the BIOS has granted us control
>> over the capability, so we may do things to it that the BIOS doesn't
>> expect.
>>
>> 2) acpiphp currently uses the presence of _ADR/_EJ0/_RMV to detect
>> hotplug slots.  I don't think this is sufficient (see
>> https://bugzilla.kernel.org/show_bug.cgi?id=54981 for details).
>> Therefore, I don't think pci_bus_has_hotplug_slots() in port_dbg.patch
>> can be accurate.  I think it returns "false" for some buses where it
>> should return "true," such as the ExpressCard slot on Chris Clayton's
>> system (see bug 54981).
> 
> But, I do not how whether and how to split the above 4 bugs into maybe more,
> better described bugs. I will repeat them likely with 3.8.5 and 3.9-rc5,
> I got quite skilled running diff all the last days and weeks. ;-)
> 
> I am waiting for some answers from you before opening bug reports.
> Please tell me how to name them and what data you want to get where.
> After I open them will try to (re)attach your patches. Ying, do you have an
> update for the port_dbg.patch per Bjorns comments about the pci_bus_has_hotplug_slots() 
> being inaccurate? I would gladly wait for an updated patch catching rather
> more scenarios than less.

Feel free to comment on the listing of deemed bugs, add more you saw in the
logs or diffs yourself (especially those downstream, secondary bugs which will
be soon masked by the hotplug issues being *fixed*). ;)
I am quite optimistic. ;))

The above listings don't contain URLs but can be all sorted out in
those respective bugzilla entries.

Thank you,
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html