Re: [sata_sil] kernel 2.6.17(-mm2) test - timeout issue

Tejun Heo <htejun@xxxxxxxxx> · Mon, 07 Aug 2006 00:51:32 +0900

Hello,

Martin Ammermüller wrote:
I tried the patch, but i couldn't see any changes in kerneloutput. I
also noticed, that there are actually two slightly different
error-messages.

#1 (shorter one, without HSM violation):
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)

DRDY (device ready), DMA engine active but no DRQ (data request), while 
READ DMA - seems like a packet loss/corruption during data transfer to 
me, but set DRDY is a bit weird.

ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)

Again, weird.  libata times out waiting for DRDY, which is weird because 
DRDY was set when the timeout occurred (as reported above) but when EH 
reset code is executed (which should have followed immediately), the 
code sees !DRDY and times out waiting for it.

ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: configured for UDMA/100
ata1: EH complete

SRST successfully recovers the device in this case.

#2 (longer, with HSM violation):
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2 frozen

SErr is reporting handshake error (R_ERR seen) diagnostic bit but not 
reporting any error bit.

ata1.00: (BMDMA stat 0x20)
ata1.00: tag 0 cmd 0xc8 Emask 0x2 stat 0x58 err 0x0 (HSM violation)

DMA engine off and DRDY && DRQ.  Again, looks like data transmission 
error but considering data transmission direction is from device to host 
(READ_DMA), the error status is confusing.

ata1: soft resetting port
ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)

prereset saw set DRDY this time but after SRST, BSY is stuck at 1.

ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ATA: abnormal status 0xD8 on port 0xDCA44087
ATA: abnormal status 0xD8 on port 0xDCA44087
ATA: abnormal status 0xD8 on port 0xDCA44087
ATA: abnormal status 0xD8 on port 0xDCA44087
ATA: abnormal status 0xD8 on port 0xDCA44087
ata1.00: qc timeout (cmd 0xec)

We should really fail softreset if ata_busy_sleep() fails in 
ata_bus_post_reset().  In this case, softreset reports success after 
timeout causing the following revalidation to timeout too.

ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: configured for UDMA/100
ata1: EH complete

hardreset successfully recovers the device.

Hmmm.. with the patch, sata_sil should have tried hardreset at the first 
time in the second case.  There's our third weirdity.  I think the 
problem can be worked around by...

1. having shorter timeout value on READ/WRITE commands.  30s is *way* 
too long.

2. making reset procedure more intelligent.  There's no reason to wait 
full 30s before and after softreset if it's not for hotplug.  It should 
switch to hardreset if the device doesn't respond in several secs. 
Being responsive && giving device enough time eventually shouldn't be 
too difficult.

#1 shouldn't be difficult but we need to be careful.  #2 might take some 
time to implement.

I'm not sure why the previous patch didn't kick in.  The condition 
should have been caught and EH_HARDRESET requested.  Can you please 
double check the patched kernel is running?  You can put a little 
printk() after the freeze: label in sil_interrupt() to be sure.  That's 
the only place where sata_sil freezes the port except for timing out.

Thanks.

--
tejun
-
: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html