Vanishing array/filesystem....

"Mike Kirk" <kenora@xxxxxxxxxx> · Sat, 17 Aug 2002 07:07:00 -0700

Hello all,

I have a linux (/dev/md0) raid 5 array consisting of 8  Western Digital
800BB (80gig) drives. They are attached to 2 Promise PDC20268 (TX2 ata100 -
non-raid) PCI controllers. They are configured as 7+1, no spare. Boot screen
and dmesg show they have their own IRQs and are seen as ide2+3, 4+5. Drives
all show up as /dev/hde -> /dev/hdl. Each drive is manually jumpered to
master or slave as appropriate (no cable select) and the Promise cards both
have the latest BIOS applied.

What happens is that after anywhere from 15 minutes to 24 hours the
filesystem/mount point stops responding. I.E. /dev/md0 is ext3 mounted to
"/export1" and anything to do with /export1 stops. "ls -l" never returns and
you can't CTRL-C it. There are no /var/adm/messages logs. No kernel panic.
Nothing on the console. Samba (smbd) process that is exporting this
filesystem cannot be kill -9'd by root. Touching any drive with hdparm never
returns and you can't CTRL-C it. But /, /boot, and /export2 (non-raid)
filesystems all continue to function normally. /proc/mdstat shows all drives
up "U". The box continues to function as normal (firewall/NAT host)
filtering packets and hosting ssh sessions. "top" shows nothing spinning.

I have tested this array on an Abit KT7 (Via KT133 chipset -3x256MB pc133 -
Athlon 1100) and a Abit KR7A (Via KT266A chipset - 2x512MB ddr266 - XP
1900+), both with latest BIOS and various memory timings (i.e. stock
non-interleaved, configured by SPD, and 4-way low wait-state tweaks).
Neither system is overclocked. Both run Enermax 430watt power supplies (2
different models purchased a year apart). On both systems I tried 2.4.18
kernel, 2.4.19rc3 and 2.4.19. I have shuffled/removed/replaced their network
cards (3 different brands) and have moved the controllers around to various
slots so they were/weren't sharing IRQs with other devices. And in both
cases /export1 becomes unresponsive after at most 24 hours. Copying large
amounts of data to the partition (both locally from another drive, or
remotely via samba) seems to cause it to fail earlier.. but I cannot
reliably reproduce the problem.... other that it has never worked for more
than a day.

5 of the 8 drives were pulled from a different host to make the array, and 3
were purchased new. Individually they all pass running badblocks. I ran both
systems overnight with memtest86 and no memory errors were found.

I am stumped. The array has enough data on it I cannot easily reconfigure it
try combinations of fewer drives. Every time it fails requires about 3 hours
to resync and fsck on boot. Since I have tried 2 systems I'm wondering if
anybody has had any issues with the WD 800BB model drives, or with the
Promise controllers?

Should I just buy a 3ware 8-port controller?

Any suggestions are appreciated.

Thanks,

    Mike

PCI: No IRQ known for interrupt pin A of device 00:11.1. Please try using
pci=biosirq.