I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12
while the other (node B) runs 2.6.18. both are running cman_tool
version 5.0.1. I first start up node A, then node B joins. node A can
mount the GFS file systems, but when node B tries that, it gets a kernel
oops, which is pasted at the end of the email (see "KERNEL OOPS output").
So I reboot node B and try to rejoin, but it seems to not be able to
communicate with node A correctly, as if the cluster is in some stale
state (see "node B rejoin kernel messages"). Upon viewing node A, it
seemed to have received the join message, but it looks like it didn't
send an ack or something, and then node A simply quits...(see "node A
kernel messages").
I think the problem lies in my use of two different cluster software
versions (even though --version doesn't say so), but the newest -rSTABLE
doesn't compile with 2.6.11.12 anymore. What is the recommended
solution for a cluster that must run different kernel versions?
tia,
dan
---
<KERNEL OOPS output>
BUG: unable to handle kernel NULL pointer dereference at virtual
address 0000001c
printing eip:
c01825e6
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx
firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod
scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core
serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror
dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core
usbcore tg3 thermal processor fan unix
CPU: 2
EIP: 0060:[<c01825e6>] Tainted: GF VLI
EFLAGS: 00010293 (2.6.18 #1)
EIP is at do_add_mount+0x66/0x130
eax: 0000000c ebx: f3843f24 ecx: c24fbac0 edx: f443f550
esi: df907200 edi: 00000000 ebp: 00000000 esp: f3843df4
ds: 007b es: 007b ss: 0068
Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000)
Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d
df907200
f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe
00000000
c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0
df98330c
Call Trace:
[<c018321d>] do_mount+0x33d/0x760
[<c0175080>] link_path_walk+0x80/0x100
[<c01507e3>] __handle_mm_fault+0x233/0x980
[<c0150a86>] __handle_mm_fault+0x4d6/0x980
[<c0147cdf>] __alloc_pages+0x4f/0x2f0
[<c0147fad>] __get_free_pages+0x2d/0x40
[<c0181ed7>] copy_mount_options+0x47/0x130
[<c01836dd>] sys_mount+0x9d/0xe0
[<c01031fb>] syscall_call+0x7/0xb
Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00
00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b>
40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44
EIP: [<c01825e6>] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4
<node B rejoin kernel messages>
CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request (message repeated 30 times)
CMAN: Been in JOINWAIT for too long - giving up
CMAN: sendmsg failed: -22
<node A kernel messages>
CMAN: node blade14 rejoining
CMAN: too many transition restarts - will die
CMAN: we are leaving the cluster. Inconsistent cluster view
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster