RAID1 disk failure causes hung mount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

We're running a modified version of the FC4 2.6.17 kernel (2.6.17.4). I realize this is an old kernel. For internal reasons, we cannot update to a newer version of the kernel at this time.

We have a 3ware 9550SXU card with 12 drives in JBOD mode. These 12 drives are mirrored in 6 RAID1 pairs, then striped together in one big RAID0 stripe. When we have a disk error with one of the drives in a RAID1 pair, the entire RAID0 mount locks up. We can still cd to the mount and read from it, but if we try to write anything to the mount, the process hangs in an unkillable state.

This recently happened.  Here are the log messages from the disk failure:

sd 0:0:9:0: SCSI error: return code = 0x8000004
sdj: Current: sense key: Medium Error
    Additional sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 10964975
raid1: sdj1: rescheduling sector 10964912
3w-9xxx: scsi0: ERROR: (0x03:0x0202): Drive ECC error:port=9.
sd 0:0:9:0: SCSI error: return code = 0x8000004
sdj: Current: sense key: Medium Error
    Additional sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 10964975
raid1: sdd1: redirecting sector 10964912 to another mirror

When this happened, /dev/sdj1 did not fail out of its RAID. It also did not lock the system. Later:

sd 0:0:9:0: SCSI error: return code = 0x8000004
sdj: Current: sense key: Medium Error
    Additional sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 16744439
raid1: sdj1: rescheduling sector 16744376

When this happened, /dev/sdj1 did not fail out of its RAID, but it did lock writes to the big RAID0 stripe. I manually failed /dev/sdj1 out of the RAID and /proc/mdstat did report it as failed out at that point. It did not cause writes to begin being processed. I tried to manually remove /dev/sdj1 from the RAID and mdadm reported that the device was busy. A hard power-cycle was required to restore functionality to the system. This is consistent with these kinds of errors.

I have looked through the patches related to RAID1 and lockups. The patches from January/March 2008 related to RAID1 deadlocks have not seemed to help (I didn't really expect them to as 2.6.17.4 predates bitmap code, no?).

I'd like to be able to get more debug during cases like this, but I'm not sure what to gather or how to gather it. If anyone has any suggestions, I'd appreciate hearing them. If any of the md devs have any suggestions for patches to look at that specifically address this behavior, I'd by very grateful for that advice. As it is, I've been combing over git commits with very little success.

Thanks in advance for any assistance, knowledge, or suggestions anyone has.

Philip
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux