Re: GFS crash

Ivan Pantovic <ivanp@xxxxxx> · Thu, 06 Jul 2006 01:14:03 +0200

Hi,

we have similar setup, till NFS part, 9nodes stable GFS 1.02 and 2.6.16.
cluster is highly unstable when we have to reboot individual nodes or 
they fence each other.

this is what nodes whine about when they tried to fence node3, what 
stable kernel version other here use?
Kernel is not compiled with any premtive code as mentioned before on 
this list, it is not tested so we didn't bother.

CMAN: node node7 has been removed from the cluster : Inconsistent 
cluster view
CMAN: node node8 has been removed from the cluster : Inconsistent 
cluster view
CMAN: node node6 has been removed from the cluster : Inconsistent 
cluster view
CMAN: removing node node4 from the cluster : No response to messages
CMAN: node node2 has been removed from the cluster : Inconsistent 
cluster view
CMAN: node node9 has been removed from the cluster : Inconsistent 
cluster view
CMAN: removing node node1 from the cluster : No response to messages
------------[ cut here ]------------
kernel BUG at 
/var/tmp/portage/cman-kernel-1.02.00/work/cluster-1.02.00/cman-kernel/src/membership.c:3151! 

invalid opcode: 0000 [#1]
SMP
Modules linked in: iptable_filter ipt_REDIRECT xt_tcpudp iptable_nat 
ip_nat ip_conntrack ip_tables x_tables bcm5700 lock_dlm dlm cman gfs 
lock_harness qla2300 qla2xxx_conf qla2xxx firmware_class
CPU:    0
EIP:    0060:[<f88de101>]    Not tainted VLI
EFLAGS: 00010246   (2.6.16-gentoo-r1 #8)
EIP is at elect_master+0x2a/0x41 [cman]
eax: 00000080   ebx: 00000080   ecx: f888a000   edx: 00000000
esi: f88f1084   edi: f5cfdfcc   ebp: f5cfdfb8   esp: f5cfdf70
ds: 007b   es: 007b   ss: 0068
Process cman_memb (pid: 7279, threadinfo=f5cfc000 task=f7c48580)
Stack: <0>f5dcb640 f88db725 f5cfdf8c 00000000 f88e91ac f5dcb640 
f88d9896 f5ad0d40
       00000000 f7c48580 f88d9c78 f58dc080 00000001 00000000 f5cfc000 
0000001f
       00000000 c0102b3e 00000000 f7c48580 c0118702 00100100 00200200 
00000000
Call Trace:
 [<f88db725>] a_node_just_died+0x172/0x1cf [cman]
 [<f88d9896>] process_dead_nodes+0x74/0x80 [cman]
 [<f88d9c78>] membership_kthread+0x3d6/0x40e [cman]
 [<c0102b3e>] ret_from_fork+0x6/0x14
 [<c0118702>] default_wake_function+0x0/0x12
 [<f88d98a2>] membership_kthread+0x0/0x40e [cman]
 [<c0101149>] kernel_thread_helper+0x5/0xb
Code: c3 53 b8 01 00 00 00 8b 1d 44 1e 8f f8 39 d8 7d 1a 8b 0d 48 1e 
8f f8 8b 14 81 85 d2 74 06 83 7a 1c 02 74 13 83 c0 01 39 d8 7c ec <0f> 
0b 4f 0c 60 6f 8e f8 31 c0 5b c3 8b 44 24 08 89 10 8b 42 14 

Bas van der Vlies wrote:
We are using kernel 2.6.16 and cvs STABLE code 1.0.2. We have a 5 node 
GFS cluster that exports the GFS filesystems as NFS to our cluster. 
This is the error log is crashed in: gfs_glockd
------------------------------------------
isa_vg5_lv2 send einval to 3
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 send einval to 4
lisa_vg5_lv1 unlock febd02eb no id
7367 pr_start cb jid 2 id 3
7367 pr_start 121 done 0
7428 recovery_done jid 2 msg 308 191b
7428 recovery_done nodeid 3 flg 1b
7428 recovery_done start_done 121
7348 pr_start last_stop 95 last_start 121 last_finish 95
7348 pr_start count 4 type 1 event 121 flags a1b
7348 pr_start cb jid 2 id 3
7348 pr_start 121 done 0
7330 pr_start last_stop 87 last_start 121 last_finish 87
7330 pr_start count 4 type 1 event 121 flags a1b
7330 pr_start cb jid 2 id 3
7330 pr_start 121 done 0
7409 recovery_done jid 2 msg 308 191b
7409 recovery_done nodeid 3 flg 1b
7409 recovery_done start_done 121
7390 recovery_done jid 2 msg 308 91b
7390 recovery_done nodeid 3 flg 1b
7390 recovery_done start_done 121
7310 pr_start last_stop 75 last_start 121 last_finish 75
7310 pr_start count 4 type 1 event 121 flags a1b
7310 pr_start cb jid 2 id 3
7310 pr_start 121 done 0
7371 recovery_done jid 2 msg 308 91b
7371 recovery_done nodeid 3 flg 1b
7371 recovery_done start_done 121
7290 pr_start last_stop 56 last_start 121 last_finish 56
7290 pr_start count 4 type 1 event 121 flags a1b
7290 pr_start cb jid 2 id 3
7290 pr_start 121 done 0
7352 recovery_done jid 2 msg 308 91b
7352 recovery_done nodeid 3 flg 1b
7352 recovery_done start_done 121
7271 pr_start last_stop 40 last_start 121 last_finish 40
7271 pr_start count 4 type 1 event 121 flags a1b
7271 pr_start cb jid 2 id 3
7271 pr_start 121 done 0
7333 recovery_done jid 2 msg 308 91b
7333 recovery_done nodeid 3 flg 1b
7333 recovery_done start_done 121
7252 pr_start last_stop 24 last_start 121 last_finish 24
7252 pr_start count 4 type 1 event 121 flags 1a1b
7252 pr_start cb jid 2 id 3
7252 pr_start 121 done 0
7314 recovery_done jid 2 msg 308 91b
7314 recovery_done nodeid 3 flg 1b
7314 recovery_done start_done 121
7294 recovery_done jid 2 msg 308 91b
7294 recovery_done nodeid 3 flg 1b
7294 recovery_done start_done 121
7275 recovery_done jid 2 msg 308 91b
7275 recovery_done nodeid 3 flg 1b
7275 recovery_done start_done 121
7256 recovery_done jid 2 msg 308 191b
7256 recovery_done nodeid 3 flg 1b
7256 recovery_done start_done 121
7310 pr_finish flags 81b
7368 pr_finish flags 81b
7348 pr_finish flags 81b
7444 pr_finish flags 181b
7329 pr_finish flags 81b
7425 pr_finish flags 181b
7405 pr_finish flags 181b
7290 pr_finish flags 81b
7252 pr_finish flags 181b
7386 pr_finish flags 81b
7272 pr_finish flags 81b
7251 pr_start last_stop 121 last_start 125 last_finish 121
7251 pr_start count 5 type 2 event 125 flags 1a1b
7251 pr_start 125 done 1
7252 pr_finish flags 181b
7271 pr_start last_stop 121 last_start 127 last_finish 121
7271 pr_start count 5 type 2 event 127 flags a1b
7271 pr_start 127 done 1
7271 pr_finish flags 81b
7291 pr_start last_stop 121 last_start 129 last_finish 121
7291 pr_start count 5 type 2 event 129 flags a1b
7291 pr_start 129 done 1
7291 pr_finish flags 81b
7311 pr_start last_stop 121 last_start 131 last_finish 121
7311 pr_start count 5 type 2 event 131 flags a1b
7311 pr_start 131 done 1
7311 pr_finish flags 81b
7330 pr_start last_stop 121 last_start 133 last_finish 121
7330 pr_start count 5 type 2 event 133 flags a1b
7330 pr_start 133 done 1
7330 pr_finish flags 81b
7349 pr_start last_stop 121 last_start 135 last_finish 121
7349 pr_start count 5 type 2 event 135 flags a1b
7349 pr_start 135 done 1
7349 pr_finish flags 81b
7367 pr_start last_stop 121 last_start 137 last_finish 121
7367 pr_start count 5 type 2 event 137 flags a1b
7367 pr_start 137 done 1
7367 pr_finish flags 81b
7386 pr_start last_stop 121 last_start 139 last_finish 121
7386 pr_start count 5 type 2 event 139 flags a1b
7386 pr_start 139 done 1
7386 pr_finish flags 81b
7406 pr_start last_stop 121 last_start 141 last_finish 121
7406 pr_start count 5 type 2 event 141 flags 1a1b
7406 pr_start 141 done 1
7406 pr_finish flags 181b
7425 pr_start last_stop 121 last_start 143 last_finish 121
7425 pr_start count 5 type 2 event 143 flags 1a1b
7425 pr_start 143 done 1
7425 pr_finish flags 181b
7443 pr_start last_stop 121 last_start 145 last_finish 121
7443 pr_start count 5 type 2 event 145 flags 1a1b
7443 pr_start 145 done 1
7443 pr_finish flags 181b

lock_dlm:  Assertion failed on line 357 of file 
/usr/src/gfs/stable_1.0.2/stable/cluster/gfs-kernel/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 1486517232
lisa_vg5_lv1: error=-22 num=3,990448c lkf=9 flags=84

------------[ cut here ]------------
kernel BUG at 
/usr/src/gfs/stable_1.0.2/stable/cluster/gfs-kernel/src/dlm/lock.c:357!
invalid opcode: 0000 [#1]
SMP
Modules linked in: lock_dlm dlm cman dm_round_robin dm_multipath sg 
ide_floppy ide_cd cdrom qla2xxx siimage piix e1000 gfs lock_harness 
dm_mod
CPU:    0
EIP:    0060:[<f8aa5586>]    Tainted: GF     VLI
EFLAGS: 00010246   (2.6.16-rc5-sara3 #1)
EIP is at do_dlm_unlock+0x91/0xaa [lock_dlm]
eax: 00000004   ebx: dbdff440   ecx: 00014e5f   edx: 00000246
esi: ffffffea   edi: f8c0b000   ebp: f22bdee0   esp: f22bded4
ds: 007b   es: 007b   ss: 0068
Process gfs_glockd (pid: 7427, threadinfo=f22bc000 task=f209d030)
Stack: <0>f8aa9d89 f8c0b000 dbdf7120 f22bdeec f8aa5824 dbdff440 
f22bdf00 f899a7bc
       dbdff440 00000003 dbdf7144 f22bdf24 f8990ca4 f8c0b000 dbdff440 
00000003
       f89c4f00 dbde1200 dbdf7120 dbdf7120 f22bdf40 f899393a dbdf7120 
dbde1200
Call Trace:
 [<c0103599>] show_stack_log_lvl+0xad/0xb5
 [<c01036db>] show_registers+0x10d/0x176
 [<c01038ad>] die+0xf2/0x16d
 [<c0103996>] do_trap+0x6e/0x8a
 [<c0103bed>] do_invalid_op+0x90/0x97
 [<c010322f>] error_code+0x4f/0x54
 [<f8aa5824>] lm_dlm_unlock+0x1d/0x24 [lock_dlm]
 [<f899a7bc>] gfs_lm_unlock+0x2c/0x46 [gfs]
 [<f8990ca4>] gfs_glock_drop_th+0xf0/0x12d [gfs]
 [<f899393a>] rgrp_go_drop_th+0x1d/0x24 [gfs]
 [<f89901f9>] rq_demote+0x79/0x95 [gfs]
 [<f89902b4>] run_queue+0x56/0xbb [gfs]
 [<f89903d6>] unlock_on_glock+0x1f/0x29 [gfs]
 [<f899232a>] gfs_reclaim_glock+0xbf/0x138 [gfs]
 [<f8986682>] gfs_glockd+0x3b/0xe3 [gfs]
 [<c0100ed9>] kernel_thread_helper+0x5/0xb
Code: 73 34 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 8b 03 ff 70 18 68 
a0 a6 aa f8 e8 80 19 67 c7 83 c4 34 68 89 9d aa f8 e8 73 19 67 c7 <0f> 
0b 65 01 c0 a4 aa f8 68 a0 a5 aa f8 e8 27 12 67 c7 8d 65 f8
 <3>fh_update: test2/CHGCAR already up-to-date!
fh_update: test2/CHGCAR already up-to-date!
fh_update: test2/WAVECAR already up-to-date!
fh_update: test2/WAVECAR already up-to-date!

--
Ivan Pantovic, System Engineer
-----
YUnet International  http://www.eunet.yu
Dubrovacka 35/III,   11000 Belgrade
Tel: +381 11 311 9901;  Fax: +381 11 311 9901; Mob: +381 63 302 288
-----
This  e-mail  is confidential and intended only for the recipient.
Unauthorized  distribution,  modification  or  disclosure  of  its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone  +381 11 311 9901.
-----

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster