Re: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote:
> Thanks for the report, Robin.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem
> to 3c076351c402 ("PCI: Rework ASPM disable code"), which appeared in
> v3.3.  For starters, can you verify that, e.g., by building
> 69166fbf02c7 (the parent of 3c076351c402) to make sure that it works,
> and building 3c076351c402 itself to make sure it fails?
> 
> Assuming that's the case, please attach the complete dmesg and "lspci
> -vvxxx" output for both kernels to the bugzilla.  ASPM is a feature
> that is configured on both ends of a PCIe link, so I want to see the
> lspci info for the whole system, not just the SAS adapters.
> 
> It's not practical to revert 3c076351c402 now, so I'd also like to see
> the same information for the newest possible kernel (if this is
> possible; I'm not clear on whether you can boot your system or not) so
> we can figure out what needs to be changed.
TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it fails to
start; Commit 3c076351c402 did make it worse, but I think we're right that the
bug lies in the SAS code.

Ok, I have done more testing on it (40+ boots), and I think we can show the
problem is somewhere in how the BIOS/EFI/ROM brings up the card in FastBoot
more, or how it leaves the card.

Full boot of the system was difficult on the 3.2 kernels, they didn't make it
to userspace for other stuff being too new. For testing, I compiled
CONFIG_MEGARAID_SAS=y on 3.2, and =m on 3.16-rc4; that way when the initramfs &
userspace failed, the megaraid load was captured over IPMI serial.

I've done a lot of the analysis below while capturing.

I was going to be booting many times, so I flipped the 'Fast Boot'
option back to Disabled, so I could more easily get to the BIOS settings
to change options while testing. When I did so, an accidental boot on a
kernel that previously failed suddenly worked, leading me to raise an
eyebrow, and this expanded my test matrix more.

3 kernels, 6 different BIOS config combinations (2x3) = 18 test cases
Each configuration was booted at least twice; if the result of two boots was
not identical, I booted a third time and took the majority result.

All kernels had no boot params involving PCI specified (none of pci=, pcie*=,
disable_msi*).

Kernels:
K.1: Ubuntu's 3.16-rc4
K.2: 3.2-rc4 3c076351c402 - aspm merged
K.3: 3.2-rc4 69166fbf02c7 - aspm merge parent
Notes: 3.2* compiled with GCC4.6, 3.16-rc4 with GCC4.8

BIOS: Boot -> FastBoot:
B1.1 Off
B1.2 On (CMOS reset default)

BIOS: Advanced -> PCIe/PCI/PnP Configuration -> ASPM Support
B2.1 Force L0s
B2.2 BIOS (CMOS reset default)
B2.3 Disabled

Reduced Kernaugh Map of results:
Kernels,B1,B2:   Result
  *, B1.1,    *  PASS
  *, B1.2, B2.1  VARIABLE (9 runs: 5 fail, 4 pass, no kernel consistency)
K.1, B1.2, B2.2  FAIL
K.1, B1.2, B2.3  FAIL
K.2, B1.2, B2.2  FAIL
K.2, B1.2, B2.3  FAIL
K.3, B1.2, B2.2  PASS
K.3, B1.2, B2.3  PASS

Here's the DMI info:
Motherboard: X9DRH-7TF/7F/iTF/iF
Version: 3.0b
Release Date: 04/28/2014

Recall also I said I had two LSI cards in here?
SAS2008 (in a slot) and SAS2208 (onboard)

Regardless of the BIOS settings, the SAS2008 card continues to work; even when
it's IO region0 is marked as disabled. So is there some other initialization
work needed on the SAS2208 card so that it works in all cases?

The case of FastBoot=on, ASPM=ForceL0s is the interesting one, and the
lspci outputs compare nicely; The only trimming to the diff below is to remove
the context of other devices (no changes).

This does also look functionally identical between 3c076351c402 and 69166fbf02c7.

Full lspci & dmesg for the working+broken 3.16-rc4 boots attaches.

-lspci.1405201451.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, working
+lspci.1405201693.ASPM=L0s.FastBoot.no.kparams = 3.16-rc4, broken
# diff -Nar   lspci.1405201451.ASPM=L0s.FastBoot.no.kparams lspci.1405201693.ASPM=L0s.FastBoot.no.kparams -I '^[0-9a-f][0-9a-f]:' -F rev   -U15
--- lspci.1405201451.ASPM=L0s.FastBoot.no.kparams	2014-07-12 21:44:11.243897367 +0000
+++ lspci.1405201693.ASPM=L0s.FastBoot.no.kparams	2014-07-12 21:48:13.866860888 +0000
@@ -1157,95 +1157,93 @@ 00:1f.6 Signal processing controller [11
(trim other device, no changes)
 01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
 	Subsystem: Super Micro Computer Inc LSI MegaRAID ROMB [15d9:0690]
-	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
+	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
-	Latency: 0, Cache Line Size: 64 bytes
 	Interrupt: pin A routed to IRQ 16
-	Region 0: I/O ports at 8000 [size=256]
+	Region 0: I/O ports at 8000 [disabled] [size=256]
 	Region 1: Memory at dfe60000 (64-bit, non-prefetchable) [size=16K]
 	Region 3: Memory at dfe00000 (64-bit, non-prefetchable) [size=256K]
 	Expansion ROM at dfe40000 [disabled] [size=128K]
 	Capabilities: [50] Power Management version 3
 		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
 		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
 	Capabilities: [68] Express (v2) Endpoint, MSI 00
 		DevCap:	MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
 			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
 		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
 			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
 			MaxPayload 256 bytes, MaxReadReq 512 bytes
 		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
 		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
 			ClockPM- Surprise- LLActRep- BwNot-
 		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
 			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
 		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
 		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
 		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
 		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
 			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
 			 Compliance De-emphasis: -6dB
 		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
 			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
 	Capabilities: [d0] Vital Product Data
-		Unknown small resource type 00, will not decode more.
+		Not readable
 	Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
 		Address: 0000000000000000  Data: 0000
-	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
+	Capabilities: [c0] MSI-X: Enable- Count=16 Masked-
 		Vector table: BAR=1 offset=00002000
 		PBA: BAR=1 offset=00003000
 	Capabilities: [100 v2] Advanced Error Reporting
 		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
 		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
 	Capabilities: [1e0 v1] #19
 	Capabilities: [1c0 v1] Power Budgeting <?>
 	Capabilities: [190 v1] #16
 	Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
 		ARICap:	MFVC- ACS-, Next Function: 0
 		ARICtl:	MFVC- ACS-, Function Group: 0
-	Kernel driver in use: megaraid_sas
-00: 00 10 5b 00 07 04 10 00 05 00 04 01 10 00 00 00
+00: 00 10 5b 00 02 00 10 00 05 00 04 01 10 00 00 00
 10: 01 80 00 00 04 00 e6 df 00 00 00 00 04 00 e0 df
 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 90 06
 30: 00 00 e4 df 50 00 00 00 00 00 00 00 0b 01 00 00
 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
 60: 00 00 00 00 00 01 00 00 10 d0 02 00 25 80 00 10
 70: 20 28 00 00 83 04 40 00 40 00 83 10 00 00 00 00
 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
 90: 00 00 00 00 0e 00 00 00 03 00 3e 00 00 00 00 00
 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
-c0: 11 00 0f 80 01 20 00 00 01 30 00 00 00 00 00 00
-d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00
+c0: 11 00 0f 00 01 20 00 00 01 30 00 00 00 00 00 00
+d0: 03 a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
(trim other device, no changes)
@@ -3049,35 +3047,35 @@ 80:05.4 PIC [0800]: Intel Corporation Xe
(trim other device, no changes)
 82:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
 	Subsystem: Dell 6Gbps SAS HBA Adapter [1028:1f1c]
-	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
+	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 	Latency: 0, Cache Line Size: 64 bytes
 	Interrupt: pin A routed to IRQ 56
-	Region 0: I/O ports at f000 [size=256]
+	Region 0: I/O ports at f000 [disabled] [size=256]
 	Region 1: Memory at fbe40000 (64-bit, non-prefetchable) [size=64K]
 	Region 3: Memory at fbe00000 (64-bit, non-prefetchable) [size=256K]
 	Expansion ROM at fbd00000 [disabled] [size=1M]
 	Capabilities: [50] Power Management version 3
 		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
 		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
 	Capabilities: [68] Express (v2) Endpoint, MSI 00
 		DevCap:	MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
 			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
 		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
 			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
 			MaxPayload 256 bytes, MaxReadReq 512 bytes
 		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
 		LnkCap:	Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
 			ClockPM- Surprise- LLActRep- BwNot-
 		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
 			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
 		LnkSta:	Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
 		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
 		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
 		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
 			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
 			 Compliance De-emphasis: -6dB
 		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
 			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
 	Capabilities: [d0] Vital Product Data
 		Unknown small resource type 00, will not decode more.
 	Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
 		Address: 0000000000000000  Data: 0000
 	Capabilities: [c0] MSI-X: Enable+ Count=15 Masked-
 		Vector table: BAR=1 offset=0000e000
 		PBA: BAR=1 offset=0000f800
 	Capabilities: [100 v1] Advanced Error Reporting
 		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
 		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
 		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
 		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
 	Capabilities: [138 v1] Power Budgeting <?>
 	Kernel driver in use: mpt2sas
-00: 00 10 72 00 07 04 10 00 03 00 07 01 10 00 00 00
+00: 00 10 72 00 06 04 10 00 03 00 07 01 10 00 00 00
 10: 01 f0 00 00 04 00 e4 fb 00 00 00 00 04 00 e0 fb
 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 1c 1f
 30: 00 00 d0 fb 50 00 00 00 00 00 00 00 0b 01 00 00
 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 50: 01 68 03 06 08 00 00 00 00 00 00 00 00 00 00 00
 60: 00 00 00 00 00 82 00 00 10 d0 02 00 25 80 00 10
 70: 2f 28 09 00 82 04 00 00 40 00 82 10 00 00 00 00
 80: 00 00 00 00 00 00 00 00 00 00 00 00 16 00 00 00
 90: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00
 a0: 00 00 00 00 00 00 00 00 05 c0 80 00 00 00 00 00
 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 c0: 11 00 0e 80 01 e0 00 00 01 f8 00 00 00 00 00 00
 d0: 03 a8 00 80 00 00 00 00 00 00 00 00 00 00 00 00
 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

(trim other device, no changes)

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail     : robbat2@xxxxxxxxxx
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Attachment: dmesg.1405201451.ASPM=L0s.FastBoot.no.kparams.WORKING.gz
Description: Binary data

Attachment: dmesg.1405201693.ASPM=L0s.FastBoot.no.kparams.BROKEN.gz
Description: Binary data

Attachment: lspci.1405201451.ASPM=L0s.FastBoot.no.kparams.WORKING.gz
Description: Binary data

Attachment: lspci.1405201693.ASPM=L0s.FastBoot.no.kparams.BROKEN.gz
Description: Binary data


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux