Re: linux-3.2: HW died, polling stopped.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Sarah,
  thank you for your answer. Answers below.

Sarah Sharp wrote:
> On Wed, Mar 21, 2012 at 11:14:52PM +0100, Martin Mokrejs wrote:
>> Dear Sarah and USB readers,
>>   I have problems with USB3.0 in 3.2.2, 3.2.5 and 3.2.11 kernels. However, most
>> of the "tests" I did in 3.2.11. For testing, I have a USB mouse connected via
>> Evolve Express Card (NEC chip uPD720200) giving me two *additional* USB3.0 ports
>> to my Dell Vostro 3550 laptop.
> 
> Ok, so you have a TI internal xHCI host controller and an external NEC
> xHCI host that's attached via Express Card, correct?

Yes.

The laptop has 2 USB3.0 in one corner. The socket on the side next to OUT/MIC
inputs probably works fine (so far I was reading a lot from that port via
second USB3.0-to-SATAII bridge -- I have two these bridges; I am not sure how
much writing I did over it). At least no errors faced yet.

In contrast, the USB socket next to the ethernet socket is causing me troubles.
In 2 hrs after lots of writes I can crash the target filesystem (ext3, ext4)
on 3.2.11 to .12 at least. 

For completeness, there is in the opposite corner a USB2.0 socket and in the third
corner is a combined eSATA/USB2.0 again.

00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
        Subsystem: Dell Device 04b3
        Flags: bus master, medium devsel, latency 0, IRQ 16
        Memory at f7f08000 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [50] Power Management version 2
        Capabilities: [58] Debug port: BAR=1 offset=00a0
        Capabilities: [98] PCI Advanced Features
        Kernel driver in use: ehci_hcd

The eSATA/SATA is handled by the SATA chipset (or the Intel 6 Series/C200 chipset above)?

00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 05) (prog-if 01 [AHCI 1.0])
        Subsystem: Dell Device 04b3
        Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 44
        I/O ports at f0b0 [size=8]
        I/O ports at f0a0 [size=4]
        I/O ports at f090 [size=8]
        I/O ports at f080 [size=4]
        I/O ports at f060 [size=32]
        Memory at f7f06000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [70] Power Management version 3
        Capabilities: [a8] SATA HBA v1.0
        Capabilities: [b0] PCI Advanced Features
        Kernel driver in use: ahci

> 
> Why do you need the Express Card, if you don't mind my asking?  Can't
> you use a USB hub on your internally connected USB ports?

It seemed to me that it is better to read through one USB3.0 chip and
write through another one, although it has to go via express card chipset as well.
Anyway, I thought that two USB3.0 ports won't be enough. Finally, I do not have
a USB3.0 hub (only one for USB2).

> 
> In general, I haven't had very good luck with xHCI Express Cards.  They
> seem to work fine for a while, and then they seem to get flaky and start
> disconnecting all the time.  I think it's because I was carrying it with
> me where ever I went, and they're just not designed to take that abuse.
> So it's possible you have a flaky Express Card, but also...

I agree that I cannot plug a usb cable into the second port of the card
if the card is already in the socket because it causes it to be unplugged.
Old PCMCIA cardbus did not have this mechanical problem.

But at the very moment, I am glad for the Express Card with the NEC chipset
because it works fine (contrary to the Texas Instruments builtin chipset).

> 
>> Suddenly, the mouse disappears from the system
>> time to time. I turned on some debugging in the kernel and if I managed
>> to ask for the dmesg output soon in time, it is related to this:
>>
>> xhci_hcd 0000:11:00.0: Poll event ring: 4295920576
>> xhci_hcd 0000:11:00.0: op reg status = 0xffffffff
>> xhci_hcd 0000:11:00.0: HW died, polling stopped.
> 
> ...Express Cards are rather easy to bump, especially when you have a
> mouse attached to the port.  If you bump the express card, it will
> electrically disconnect from the PCI express bus, and the registers will
> read as all "f"s (as you can see from the op reg status).  Then the xHCI
> host controller driver will signal to the USB core that all the USB
> devices under your host disconnected.  If you jiggle the card again, it
> may re-connect, and the xHCI driver will reload and re-enumerate the
> device.
> 
> Maybe try moving the mouse and keyboard to a different port?  Or just
> plug in a USB 2.0 hub into your internally-connected USB 3.0 port.

Umm, no, it is not a problem with mechanical disconnection. I think the USB
or SCSI mixes up sometimes devices. I reported this issue because it
seemed an obvious error to report. It seemed to me sometimes, when USB resets
a device connected thgrough USB2.0 hub, it actually resets all devices
on that hub.

> 
>>   I have attached a file again-stopped-xhci-on-3.2.11.txt where you can find this
>> at about line 3332. Into the same file I smahed then lspci, .config and lsusb
>> after the error occurred.
>>
>>   Funny is that lspci once reported:
>>
>> 11:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev ff) (prog-if ff)
>>         !!! Unknown header type 7f
>>         Kernel driver in use: xhci_hcd
>>
>>   I then went to unplug the Express Card and re-plug again, so you will see this in
>> the logs in that file as well.
>>
>>   I am not sure whether I managed to get the issue reproduce without other USB devices
>> attached. I think I always had to have plugged into eSATA/USB2.0 my external keyboard
>> at least.
>>   However, in some tests you will see xhci_hcd 0000:0b:00.0 where is an external USB3.0
>> controller with my external hard disk (but this is connected to the internal chipset
>> inside the laptop, not via the Express card where the mouse is, for testing). The disk
>> controller makes the sleep sleep after some time.
> 
> What do you mean by "The disk controller makes the sleep sleep after
> some time"?

The chipset (JMicron) in the black plastic casing decides after (maybe 10minutes?) to power
down the attached disk, regardless of any kernel USB suspend settings.

> 
>> I was not using the disk intentionally,
>> just kept it plugged in during some of my tests.
>>
>>   I haven't seen anything logged in /var/log/messages, only dmesg was giving output
>> when the debug in USB and PCI was turned on.
>>
>>
>> ***********
>>   I have bits of logs of other attempts. Another replication is in stopped-xhci-on-3.2.11.txt
>> file. There is again the "xhci_hcd 0000:11:00.0: HW died, polling stopped." message.
>>
>> ***********
>>   I think merely a pure bootup of the system is logged in new.dmesg2.txt file, just to give
>> you an idea what hardware is this about. new.dmesg.txt has about the same value.
>>
>> ***********
>>   The file xhci-died-3.2.11.txt does not contain the "HW died" message but maybe I just
>> did not have enabled the verbose logging yet? Based on the timestamp it was the very first
>> file with logs I wrote.
>>
>> My apologies for this rather messy email. I just do not know where to start. ;)
>> I have prepared the usbmon support in the kernel but maybe the verbose logging
>> is already enough? Or is this a PCI Express Hotplug issue?
>>
>> P.s.: Could it be related to the MMAPPED IO options set in my .config?
> 
> Probably not.

It is weird. I think it has to do with kernel mixing up the USB devices although they are
on different chipsets and ports. It could be actually SCSI fault.
(to illustrate, recently on another host, I unplugged firewire cable from a disk
but kernel reported I unplugged my USB disk, really a completely other drive).
I think I am facing here something similar and that is why reported this.

> 
>> 0b:00.0 USB controller: Texas Instruments Device 8241 (rev 02) (prog-if 30 [XHCI])
>> 	Subsystem: Dell Device 04b3
>> 	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Latency: 0, Cache Line Size: 64 bytes
>> 	Interrupt: pin A routed to IRQ 16
>> 	Region 0: Memory at f7d00000 (64-bit, non-prefetchable) [size=64K]
>> 	Region 2: Memory at f7d10000 (64-bit, non-prefetchable) [size=8K]
>> 	Capabilities: [40] Power Management version 3
>> 		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=100mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
>> 		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>> 	Capabilities: [48] MSI: Enable- Count=1/8 Maskable- 64bit+
>> 		Address: 0000000000000000  Data: 0000
>> 	Capabilities: [70] Express (v2) Endpoint, MSI 00
>> 		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>> 			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>> 		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>> 			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>> 			MaxPayload 128 bytes, MaxReadReq 512 bytes
>> 		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>> 		LnkCap:	Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
>> 			ClockPM+ Surprise- LLActRep- BwNot-
>> 		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
>> 			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>> 		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>> 		DevCap2: Completion Timeout: Not Supported, TimeoutDis+
>> 		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
>> 		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
>> 			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>> 			 Compliance De-emphasis: -6dB
>> 		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>> 			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>> 	Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
>> 		Vector table: BAR=2 offset=00000000
>> 		PBA: BAR=2 offset=00001000
>> 	Capabilities: [100 v2] Advanced Error Reporting
>> 		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> 		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> 		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>> 		CESta:	RxErr- BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr+
>> 		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>> 		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>> 	Capabilities: [150 v1] Device Serial Number 08-00-28-00-00-20-00-00
>> 	Kernel driver in use: xhci_hcd
> 
> So you have an internal TI xHCI host controller as well?  How does that
> work for you?

I will keep that under the thread "Re: xhci_hcd 0000:0b:00.0: WARN: transfer error on endpoint",
but the USB3.0 port next to the ethernet socket is somehow bad (probably unlike the USB3.0 port next
to MIC/OUT from soundcard).

> 
>> 11:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev ff) (prog-if ff)
>> 	!!! Unknown header type 7f
--------^^^^^^^^^^^^^^^^^^^^^^^^^^
Why is this?


>> 	Kernel driver in use: xhci_hcd
>>
>>            CPU0       CPU1       CPU2       CPU3       
>>   0:         58          0          0          0   IO-APIC-edge      timer
>>   1:          9          0          0          0   IO-APIC-edge      i8042
>>   8:         96          0          0          0   IO-APIC-edge      rtc0
>>   9:          6          0          0          0   IO-APIC-fasteoi   acpi
>>  12:       6149          0          0          0   IO-APIC-edge      i8042
>>  16:       7976          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1
>>  23:        180          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb2
>>  40:          0          0          0          0  DMAR_MSI-edge      dmar0
>>  41:          0          0          0          0  DMAR_MSI-edge      dmar1
>>  42:          4          0          0          0   PCI-MSI-edge      pciehp
>>  43:     222595          0          0          0   PCI-MSI-edge      i915
>>  44:      40867          0          0          0   PCI-MSI-edge      ahci
>>  45:          0          0          0          0   PCI-MSI-edge      eth0
>>  46:      43409          0          0          0   PCI-MSI-edge      xhci_hcd
>>  47:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  48:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  49:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  50:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  51:     101576          0          0          0   PCI-MSI-edge      xhci_hcd
>>  52:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  53:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  54:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  55:          0          0          0          0   PCI-MSI-edge      xhci_hcd
>>  56:         14          0          0          0   PCI-MSI-edge      mei
>>  57:        270          0          0          0   PCI-MSI-edge      snd_hda_intel
>>  58:     374411          0          0          0   PCI-MSI-edge      iwlwifi
>> NMI:          0          0          0          0   Non-maskable interrupts
>> LOC:     551563     358343     434805     521638   Local timer interrupts
>> SPU:          0          0          0          0   Spurious interrupts
>> PMI:          0          0          0          0   Performance monitoring interrupts
>> IWI:          0          0          0          0   IRQ work interrupts
>> RES:    1209730    1192147    1096504    1242733   Rescheduling interrupts
>> CAL:        121        173        166         83   Function call interrupts
>> TLB:       2803       4764       4722       6607   TLB shootdowns
>> TRM:          0          0          0          0   Thermal event interrupts
>> THR:          0          0          0          0   Threshold APIC interrupts
>> MCE:          0          0          0          0   Machine check exceptions
>> MCP:         34         34         34         34   Machine check polls
>> ERR:          0
>> MIS:          0
> 
> Sarah Sharp

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux