Dan B. Phung wrote: > I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12 > while the other (node B) runs 2.6.18. both are running cman_tool > version 5.0.1. I first start up node A, then node B joins. node A can > mount the GFS file systems, but when node B tries that, it gets a kernel > oops, which is pasted at the end of the email (see "KERNEL OOPS output"). > So I reboot node B and try to rejoin, but it seems to not be able to > communicate with node A correctly, as if the cluster is in some stale > state (see "node B rejoin kernel messages"). Upon viewing node A, it > seemed to have received the join message, but it looks like it didn't > send an ack or something, and then node A simply quits...(see "node A > kernel messages"). > > I think the problem lies in my use of two different cluster software > versions (even though --version doesn't say so), but the newest -rSTABLE > doesn't compile with 2.6.11.12 anymore. What is the recommended > solution for a cluster that must run different kernel versions? > > tia, > dan > > --- > > <KERNEL OOPS output> > > BUG: unable to handle kernel NULL pointer dereference at virtual > address 0000001c > printing eip: > c01825e6 > *pde = 00000000 > Oops: 0000 [#1] > PREEMPT SMP > Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx > firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod > scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core > serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror > dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core > usbcore tg3 thermal processor fan unix > CPU: 2 > EIP: 0060:[<c01825e6>] Tainted: GF VLI > EFLAGS: 00010293 (2.6.18 #1) > EIP is at do_add_mount+0x66/0x130 > eax: 0000000c ebx: f3843f24 ecx: c24fbac0 edx: f443f550 > esi: df907200 edi: 00000000 ebp: 00000000 esp: f3843df4 > ds: 007b es: 007b ss: 0068 > Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000) > Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d > df907200 > f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe > 00000000 > c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0 > df98330c > Call Trace: > [<c018321d>] do_mount+0x33d/0x760 > [<c0175080>] link_path_walk+0x80/0x100 > [<c01507e3>] __handle_mm_fault+0x233/0x980 > [<c0150a86>] __handle_mm_fault+0x4d6/0x980 > [<c0147cdf>] __alloc_pages+0x4f/0x2f0 > [<c0147fad>] __get_free_pages+0x2d/0x40 > [<c0181ed7>] copy_mount_options+0x47/0x130 > [<c01836dd>] sys_mount+0x9d/0xe0 > [<c01031fb>] syscall_call+0x7/0xb > Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00 > 00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b> > 40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44 > EIP: [<c01825e6>] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4 > > <node B rejoin kernel messages> > CMAN: Waiting to join or form a Linux-cluster > CMAN: sending membership request (message repeated 30 times) > CMAN: Been in JOINWAIT for too long - giving up > CMAN: sendmsg failed: -22 > > <node A kernel messages> > CMAN: node blade14 rejoining > CMAN: too many transition restarts - will die > CMAN: we are leaving the cluster. Inconsistent cluster view That's a known bug. Upgrade the kernel component of cman. -- patrick -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster