Re: JMicron - hard resetting link

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



To be honest, I didn't believe that doing anything with the PSU
would do something.
However, seemingly it did.
I have also updated the BIOS, but I guess this has not much
to do with it.
So a different brand PSU was additionally installed, and this
one got the motherboard and the 4 disk which were failing.
The "old" PSU got the second 4 hdds and the 2 other system
HDDs.
Test was started yesterday (Feb 13) about 16:30 CET including
array building up and file copies. About today (14) 20:22 the
problem appeared, but seemingly "moved" with the PSU to the
other 4 disks bunch (on nvidia controller) - more precisely, only
2 of them (array is still operational).

Feb 14 20:22:32 storage1 kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Feb 14 20:22:32 storage1 kernel: ata10.00: cmd c8/00:00:c3:d5:3b/00:00:00:00:00/e2 tag 0 dma 131072 in Feb 14 20:22:32 storage1 kernel: res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 20:22:32 storage1 kernel: ata10.00: status: { DRDY }
Feb 14 20:22:32 storage1 kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Feb 14 20:22:32 storage1 kernel: ata9.00: cmd c8/00:00:c3:d5:3b/00:00:00:00:00/e2 tag 0 dma 131072 in Feb 14 20:22:32 storage1 kernel: res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 20:22:32 storage1 kernel: ata9.00: status: { DRDY }
Feb 14 20:22:33 storage1 kernel: ata10: soft resetting link
Feb 14 20:22:33 storage1 kernel: ata9: soft resetting link
Feb 14 20:22:33 storage1 kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 14 20:22:33 storage1 kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 20:23:03 storage1 kernel: ata9.00: qc timeout (cmd 0x27)
Feb 14 20:23:03 storage1 kernel: ata9.00: failed to read native max address (err_mask=0x4) Feb 14 20:23:03 storage1 kernel: ata9.00: HPA support seems broken, will skip HPA handling
Feb 14 20:23:03 storage1 kernel: ata9.00: revalidation failed (errno=-5)
Feb 14 20:23:03 storage1 kernel: ata9: failed to recover some devices, retrying in 5 secs
Feb 14 20:23:03 storage1 kernel: ata10.00: qc timeout (cmd 0x27)
Feb 14 20:23:03 storage1 kernel: ata10.00: failed to read native max address (err_mask=0x4) Feb 14 20:23:03 storage1 kernel: ata10.00: HPA support seems broken, will skip HPA handling
Feb 14 20:23:03 storage1 kernel: ata10.00: revalidation failed (errno=-5)
Feb 14 20:23:03 storage1 kernel: ata10: failed to recover some devices, retrying in 5 secs
Feb 14 20:23:08 storage1 kernel: ata9: hard resetting link
Feb 14 20:23:08 storage1 kernel: ata10: hard resetting link
...

Full kern.log is at:
http://www.huweb.hu/maques/tmp/jmicron/kern0214.log

So it seems that there is definitely something with the "old" PSU.

Also, I tried to mount the failed drives, without success.

Thought I let you know.
Now I will try with the only one, "new" PSU to see what happens...

G.


----- Original Message ----- From: "Tejun Heo" <htejun@xxxxxxxxx>
To: "Gabor FUNK" <FUNK.Gabor@xxxxxxxxxxx>
Cc: "IDE/ATA development list" <linux-ide@xxxxxxxxxxxxxxx>
Sent: Wednesday, February 13, 2008 12:50 AM
Subject: Re: JMicron - hard resetting link


Hello,

Gabor FUNK wrote:
What I said was that timeouts occurring due to transmission errors
should be recoverable.  It seems like IRQ delivery didn't work probably
due to screaming IRQ.  I need to see the messages before the first
relevant error message.  It's always a good idea to post full kernel log
from boot till failure.  Things which don't seem relevant are often
relevant.
Naturally. Full kern.log with boot:
http://www.huweb.hu/maques/tmp/jmicron/kern.log
(no edits, there are really only those 2 lines between Feb 6 and Feb 9's
1st exception)

Hmmm... Indeed.  This is the first time this mode of failure is reported.

Previously there was kernel 2.6.23.9 and I noticed the following in
syslog by then:
Feb  6 19:10:19 storage1 kernel: ata4: D2H reg with I during NCQ, this
message won't be printed again
Feb  6 19:10:20 storage1 kernel: ata1: D2H reg with I during NCQ, this
message won't be printed again
Feb  6 19:10:20 storage1 kernel: ata2: D2H reg with I during NCQ, this
message won't be printed again
Feb  6 19:10:21 storage1 kernel: ata3: D2H reg with I during NCQ, this
message won't be printed again

I googled and saw that there was some fixes related to this (maybe it
was you), so that's why we hoped that 2.6.24 will fix this. Actually the
above error messages were gone, but...

Yeap, those are gone.

Till now, none of this kind of problem has been tracked down to MB or
the controller while 90% of hardware problems turned out to be power
related.
I'll put a brand new, probably different PSU in the case and put the MB
and the 4 disks of the problematic controller on it, and put the 2 system
and other 4 disks to this one (or even another one).

Yeap, please keep me posted.

Meanwhile I'd welcome if you have any suggestion why controller reset
causing a "fatal error"...
BTW, the drives were accessible after the array broke (when I got there).

What do you mean by 'drives were accessible'?  /dev/sdX nodes were
accessible?

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux