On Monday, February 06, 2017 12:37:06 PM Mika Westerberg wrote: > On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote: > > > sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power > > > [ 375.376609] pci_hotplug: power_write_file: power = 1 > > > [ 375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status: SLOTCTRL a8 value read 17f1 > > > [ 375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status > > > [ 375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0 > > > [ 375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200 > > > [ 375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status > > > [ 375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status > > > [ 377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link Active not set in 1000 msec > > > [ 378.960364] pci 0000:b4:00.0 id reading try 50 times with interval 20 ms to get ffffffff > > > [ 378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status: lnk_status = 5001 > > > [ 378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001 > > > [ 378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status > > > [ 378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status > > > [ 378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400 > > > [ 380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off: SLOTCTRL a8 write cmd 300 > > > [ 380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status > > > [ 380.018020] pciehp 0000:b3:00.0:pcie004: pciehp_set_attention_status: SLOTCTRL a8 write cmd 40 > > > [ 380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status > > It would be good to see the output when 68db9bc is reverted. Yinghai, > can you attach that to the bugzilla but as well? > > > So on this Skylake machine link training fails after resuming from D3hot > > to D0. > > > > One thing that's a bit fishy is that normally the Link Disable bit is > > cleared when powering on the slot. This results in a debug message > > in dmesg containg the string "lnk_ctrl = ", and that line is missing > > from the output you've pasted above, suggesting that the machine is > > not running a stock v4.10 kernel after all but something else. Could > > you check why this message is not printed? Could you check with lspci > > if the Link Disable bit is set before you invoke "echo 1"? > > > > This is the call stack: > > pciehp_sysfs_enable_slot() > > pciehp_enable_slot() > > board_added() > > pciehp_power_on_slot() > > pciehp_link_enable() > > __pciehp_link_set() > > > > Another theory is that the link is generally unreliable on this machine > > since the Link Bandwidth Management Status bit is set in the Link Status > > Register ("lnk_status = 5001"), which according to the spec means: > > > > "Hardware has changed Link speed or width to attempt to correct unreliable > > Link operation, either through an LTSSM timeout or a higher level process. > > This bit must be set if the Physical Layer reports a speed or width change > > was initiated by the Downstream component that was not indicated as an > > autonomous change." > > > > In this case it would be good to know which hardware exactly we're dealing > > with so that we might quirk it to not runtime suspend the port. To that > > end, could you attach a full dmesg log to the bugzilla entry I've created? > > https://bugzilla.kernel.org/show_bug.cgi?id=193951 > > > > @Mika, Rafael: Are you aware of Skylake machines with unreliable link > > training, or perhaps errata of Skylake chips related to link training > > on hotplug ports? > > According to the 100-series (the chipset used with Skylake) errata > below, I don't see any mentions related to PCIe link training issues. > > http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/100-series-chipset-spec-update.pdf Still, it does look like errata to me. At least I don't see what can be done on the software side to avoid this from happening except for leaving the port(s) in question in D0. Thanks, Rafael