Hi Tejun,
Just as some further testing and poking I added the drives to the
list of disks to disable NCQ for, it didn't resolve the issue.
I increased the PMP timeout to 1000 rather then 250 and that didn't
resolve the problem either.
The interface still has timeout errors writing the ext3 fs.
Thanks,
Rusty
On Aug 20, 2007, at 1:56 PM, Rusty Conover wrote:
Hi Tejun,
I've taken your advice, reseat-ed and re-cabled everything. I did
find one bad drive that I've removed, but sadly I'm still having
problems.
I've done some more testing that may be able to help you out.
I've tested all 5 WDC drives, they all work. The problem is I get
this exception:
ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
ata6.00: cmd 60/80:00:3f:45:08/00:00:00:00:00/40 tag 0 cdb 0x0 data
65536 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
It only happens whenever I have any drive in any of the PMP ports.
I can have 4 drives in all native ports and they all work great.
I've tested all of the position/port/disk combinations, so I've
eliminated the drives as being part of the problem. I can swap any
combination into the 4 native SATA ports and things work great
(that is I can setup a RAID10, and create a ext3 fs without any
resets).
It doesn't matter if I place the drive in the first or second PMP
group it still causes a timeout.
On the Norco-1220 block diagram it shows:
Bays
1-4 = Sil3726 #1 - PMP
5 = Sil3726 #1 - Native SATA port
6-9 = Sil3726 #2 - PMP
10 = Sil3726 #2 - Native SATA port
11 = Sil3124 - Native SATA port
12 = Sil3124 - Native SATA port
When I've got disks in ports 5, 10, 11, and 12 thinks work great,
if any disks are in ports 1-4 or 6-9 I have timeout problems.
I've tried turning down the speed of the PCI-X board, but it
doesn't have any effect.
I've posted my kernel log at:
http://rusty.devel.infogears.com/silerrors.txt
The interesting thing is when I create the raid with:
echo -n 500000 > /proc/sys/dev/raid/speed_limit_max
mdadm --create /dev/md2 --chunk=128 --level=10 --layout n2 --raid-
devices=5 /dev/sd{c,d,e,f,g}1
mkfs -t ext3 -b 4096 -m 0 -R stride=16 /dev/md2
It always fails around the same area but not the same exact inode
table number, with just over 2000 inode tables written its where
the first error is triggered. Could this be some point a point of
inflection where the I/O is no longer hitting the cache, and
therefore the disk times out since the disk or PMP port can't keep
up with some built in timer?
I've posted the results of
hdparm -I and smartctl -a for all of the disks at:
http://rusty.devel.infogears.com/disk.info.txt
Thank you for your help,
Rusty
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html