mdadm freezes the system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello.

I am having a strange issue with md RAID on the 2.6.34 kernel. To be
specific, it sometimes locks up the system completely, with the following
symptoms:
- any attempt to read from an array seems to never return
- no errors at all on the server console
- in one lock-up episode I had "top" running, which displayed zero CPU
  load (no mdX_raidX in sight on top of the CPU-load sorted list)
- Alt-SysRQ-B works, and allows to reboot the system

Now, regarding when this happens. I had two such lock-ups shortly after moving
my root FS to RAID5; after the first one I changed the FS from XFS to Ext4
(this did not help), after the second one I disabled NCQ on all drives and the
write intent bitmap on the array. After that, it worked for maybe a week of
intense reads/writes onto the arrays with no more hangs.

Today, I have decided to convert a three-member RAID5 into a four-member
RAID6. mdadm segfaulted(!) right after the --grow command, and dmesg had
an error about md being unable to overwrite the /sys/.....stripe_cache_size
file. (As I understand, this is already fixed in the latest kernel).

The array then started rebuilding as 4-member RAID6 seemingly fine, but
shortly after, the system locked up in the same manner as described above.

Several attempts to do the rebuild after reboots consistently caused the same
lock-ups early in the rebuild (at less than 1% done). So for now, I decided to
give up and returned the array to its previous RAID5 three-member
configuration, which went fine.

The configuration:
md0 is 3* 1990GB RAID5
md1 is 3* 10GB RAID5 (root FS)
Three drives are 2* WD20EADS and 1* Hitachi 2TB drive. Fourth array member I
was trying to add to md0, is a RAID0 of two 1TB drives (Seagate and Hitachi).
SATA controllers are nForce4 chipset and a PCI-E JMicron JMB363. I am using
mdadm 3.1.2 now, and going to try the 2.6.35-rc2 kernel.

So, my question is, does anyone have an idea on what could cause this, and what
would be the best way to diagnose/fix the lockup problem?  Thanks in advance.

-- 
With respect,
Roman

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux