On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: > Thanks for the report, Robin. > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem > to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in > v3.3. For starters, can you verify that, e.g., by building > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it works, > and building 3c076351c402 itself to make sure it fails? > > Assuming that's the case, please attach the complete dmesg and "lspci > -vvxxx" output for both kernels to the bugzilla. ASPM is a feature > that is configured on both ends of a PCIe link, so I want to see the > lspci info for the whole system, not just the SAS adapters. > > It's not practical to revert 3c076351c402 now, so I'd also like to see > the same information for the newest possible kernel (if this is > possible; I'm not clear on whether you can boot your system or not) so > we can figure out what needs to be changed. TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it fails to start; Commit 3c076351c402 did make it worse, but I think we're right that the bug lies in the SAS code. Ok, I have done more testing on it (40+ boots), and I think we can show the problem is somewhere in how the BIOS/EFI/ROM brings up the card in FastBoot more, or how it leaves the card. Full boot of the system was difficult on the 3.2 kernels, they didn't make it to userspace for other stuff being too new. For testing, I compiled CONFIG_MEGARAID_SAS=y on 3.2, and =m on 3.16-rc4; that way when the initramfs & userspace failed, the megaraid load was captured over IPMI serial. I've done a lot of the analysis below while capturing. I was going to be booting many times, so I flipped the 'Fast Boot' option back to Disabled, so I could more easily get to the BIOS settings to change options while testing. When I did so, an accidental boot on a kernel that previously failed suddenly worked, leading me to raise an eyebrow, and this expanded my test matrix more. 3 kernels, 6 different BIOS config combinations (2x3) = 18 test cases Each configuration was booted at least twice; if the result of two boots was not identical, I booted a third time and took the majority result. All kernels had no boot params involving PCI specified (none of pci=, pcie*=, disable_msi*). Kernels: K.1: Ubuntu's 3.16-rc4 K.2: 3.2-rc4 3c076351c402 - aspm merged K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8 BIOS: Boot -> FastBoot: B1.1 Off B1.2 On (CMOS reset default) BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support B2.1 Force L0s B2.2 BIOS (CMOS reset default) B2.3 Disabled Reduced Kernaugh Map of results: Kernels,B1,B2: Result *, B1.1, * PASS *, B1.2, B2.1 VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency) K.1, B1.2, B2.2 FAIL K.1, B1.2, B2.3 FAIL K.2, B1.2, B2.2 FAIL K.2, B1.2, B2.3 FAIL K.3, B1.2, B2.2 PASS K.3, B1.2, B2.3 PASS Here's the DMI info: Motherboard: X9DRH-7TF/7F/iTF/iF Version: 3.0b Release Date: 04/28/2014 Recall also I said I had two LSI cards in here? SAS2008 (in a slot) and SAS2208 (onboard) Regardless of the BIOS settings, the SAS2008 card continues to work; even when it's IO region0 is marked as disabled. So is there some other initialization work needed on the SAS2208 card so that it works in all cases? The case of FastBoot=on, ASPM=ForceL0s is the interesting one, and the lspci outputs compare nicely; The only trimming to the diff below is to remove the context of other devices (no changes). This does also look functionally identical between 3c076351c402 and 69166fbf02c7. Full lspci & dmesg for the working+broken 3.16-rc4 boots attaches. -lspci.1405201451.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, working +lspci.1405201693.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, broken # diff -Nar lspci.1405201451.ASPM=L0s.FastBoot.no.kparams lspci.1405201693.ASPM=L0s.FastBoot.no.kparams -I '^[0-9a-f][0-9a-f]:' -F rev -U15 --- lspci.1405201451.ASPM=L0s.FastBoot.no.kparams 2014-07-12 21:44:11.243897367 +0000 +++ lspci.1405201693.ASPM=L0s.FastBoot.no.kparams 2014-07-12 21:48:13.866860888 +0000 @@ -1157,95 +1157,93 @@ 00:1f.6 Signal processing controller [11 (trim other device, no changes) 01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05) Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690] - Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ + Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- - Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 - Region 0: I/O ports at 8000 [size=256] + Region 0: I/O ports at 8000 [disabled] [size=256] Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at dfe40000 [disabled] [size=128K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+ Capabilities: [d0] Vital Product Data - Unknown small resource type 00, will not decode more. + Not readable Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 - Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- + Capabilities: [c0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=1 offset=00002000 PBA: BAR=1 offset=00003000 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [1e0 v1] #19 Capabilities: [1c0 v1] Power Budgeting <?> Capabilities: [190 v1] #16 Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 - Kernel driver in use: megaraid_sas -00: 00 10 5b 00 07 04 10 00 05 00 04 01 10 00 00 00 +00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00 10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06 30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10 70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -c0: 11 00 0f 80 01 20 00 00 01 30 00 00 00 00 00 00 -d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00 +c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00 +d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (trim other device, no changes) @@ -3049,35 +3047,35 @@ 80:05.4 PIC [0800]: Intel Corporation Xe (trim other device, no changes) 82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03) Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c] - Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ + Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 56 - Region 0: I/O ports at f000 [size=256] + Region 0: I/O ports at f000 [disabled] [size=256] Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at fbd00000 [disabled] [size=1M] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [d0] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable+ Count=15 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f800 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [138 v1] Power Budgeting <?> Kernel driver in use: mpt2sas -00: 00 10 72 00 07 04 10 00 03 00 07 01 10 00 00 00 +00: 00 10 72 00 06 04 10 00 03 00 07 01 10 00 00 00 10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f 30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10 70: 2f 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00 90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 11 00 0e 80 01 e0 00 00 01 f8 00 00 00 00 00 00 d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (trim other device, no changes) -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead E-Mail : robbat2@xxxxxxxxxx GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
Attachment:
dmesg.1405201451.ASPM=L0s.FastBoot.no.kparams.WORKING.gz
Description: Binary data
Attachment:
dmesg.1405201693.ASPM=L0s.FastBoot.no.kparams.BROKEN.gz
Description: Binary data
Attachment:
lspci.1405201451.ASPM=L0s.FastBoot.no.kparams.WORKING.gz
Description: Binary data
Attachment:
lspci.1405201693.ASPM=L0s.FastBoot.no.kparams.BROKEN.gz
Description: Binary data