On 7/7/05, Farkas Levente <lfarkas@xxxxxxxxx> wrote: > hi, > after we switch our servers from centos-3 to centos-4 (aka. rhel-4) one > of our server always crash once a week without any oops. this happneds > with both the normal kernel-2.6.9-11.EL and > kernel-2.6.9-11.106.unsupported. after we change the motherboard, the > raid contorller and the cables too we still got it. finally we start > netdump and last but not least yesterday we got a crash log and a core > file. it seems there is a bug in the raid5 code of the kernel. > this is our backup server with 8 x 200GB hdd in a raid5 (for the data) > plus 2 x 40GB hdd in raid1 (for the system) with 3ware 8xxx raid > contorller, running. i attached the netdump log of the last crash. > how can i fix it? > yours. > Hi, I have seen similar (but not quite the same) in the raid code on RHEL 3 kernels. They typically have occured due to a race condition between something updating the linked lists of raid devices and something trying to read them. For RHEL 3, my co-workes and I found where one particular race condition was fixed in 2.6 kernel and back ported to RHEL 3 kernel. Ultimately this patch was placed in one of the updates for the RHEL 3 kernel. Anyway, it is likely your problem is yet another race condition. What I would suggest doing is get a box configured with true RHEL 4 and reproduce. Once reproduced file a bugzilla report with redhat. We have had very good success with this approach with a number of kernel bugs we found in the Centos 3/RHEL 3 kernels. Fixes have not always come quickly, but they generally do come. Good Luck...james > --