Re: Resets on sil3124 & sil3726 PMP

Rusty Conover <rconover@xxxxxxxxxxxxx> · Mon, 20 Aug 2007 20:42:42 -0600

Hi Tejun,

Just as some further testing and poking I added the drives to the  
list of disks to disable NCQ for, it didn't resolve the issue.

I increased the PMP timeout to 1000 rather then 250 and that didn't  
resolve the problem either.

The interface still has timeout errors writing the ext3 fs.

Thanks,

Rusty

On Aug 20, 2007, at 1:56 PM, Rusty Conover wrote:

Hi Tejun,

I've taken your advice, reseat-ed and re-cabled everything.  I did  
find one bad drive that I've removed, but sadly I'm still having  
problems.

I've done some more testing that may be able to help you out.

I've tested all 5 WDC drives, they all work.  The problem is I get  
this exception:

ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
ata6.00: cmd 60/80:00:3f:45:08/00:00:00:00:00/40 tag 0 cdb 0x0 data  
65536 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

It only happens whenever I have any drive in any of the PMP ports.   
I can have 4 drives in all native ports and they all work great.  
I've tested all of the position/port/disk combinations, so I've  
eliminated the drives as being part of the problem. I can swap any  
combination into the 4 native SATA ports and things work great  
(that is I can setup a RAID10, and create a ext3 fs without any  
resets).

It doesn't matter if I place the drive in the first or second PMP  
group it still causes a timeout.

On the Norco-1220 block diagram it shows:

Bays
1-4 = Sil3726 #1 - PMP
5   = Sil3726 #1 - Native SATA port
6-9 = Sil3726 #2 - PMP
10  = Sil3726 #2 - Native SATA port
11  = Sil3124 - Native SATA port
12  = Sil3124 - Native SATA port

When I've got disks in ports 5, 10, 11, and 12 thinks work great,  
if any disks are in ports 1-4 or 6-9 I have timeout problems.

I've tried turning down the speed of the PCI-X board, but it  
doesn't have any effect.

I've posted my kernel log at:

http://rusty.devel.infogears.com/silerrors.txt

The interesting thing is when I create the raid with:

echo -n 500000 > /proc/sys/dev/raid/speed_limit_max
mdadm --create /dev/md2 --chunk=128 --level=10 --layout n2 --raid- 
devices=5 /dev/sd{c,d,e,f,g}1
mkfs -t ext3 -b 4096 -m 0 -R stride=16 /dev/md2

It always fails around the same area but not the same exact inode  
table number, with just over 2000 inode tables written its where  
the first error is triggered.  Could this be some point a point of  
inflection where the I/O is no longer hitting the cache, and  
therefore the disk times out since the disk or PMP port can't keep  
up with some built in timer?

I've posted the results of

hdparm -I and smartctl -a for all of the disks at:

http://rusty.devel.infogears.com/disk.info.txt

Thank you for your help,

Rusty

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html