I'm helping a colleague to collect information on an application lockup problem on a two-node DLM/GFS cluster, with GFS on a shared SCSI array. I'd appreciate advice as to what information to collect next. Packages in use are: kernel-smp-2.6.9-67.EL.i686.rpm dlm-1.0.7-1.i686.rpm dlm-kernel-smp-2.6.9-52.2.i686.rpm GFS-kernel-smp-2.6.9-75.9.i686.rpm GFS-6.1.15-1.i386.rpm ccs-1.0.11-1.i686.rpm cman-1.0.17-0.i686.rpm cman-kernel-smp-2.6.9-53.5.i686.rpm We've reduced the application code to a simple test case. The following code run on each node will soon block, and doesn't receive signals until the peer node is shutdown: ... fl.l_whence=SEEK_SET; fl.l_start=0; fl.l_len=1; while (1) { fl.l_type=F_WRLCK; retval=fcntl(filedes,F_SETLKW,&fl); if (retval==-1) { perror("lock"); exit(1); } // attempt to unlock the index file fl.l_type=F_UNLCK; retval=fcntl(filedes,F_SETLKW,&fl); if (retval==-1) { perror("unlock"); exit(1); } } ... /proc/cluster/dlm_debug on the respectives nodes showed this on most recent run: Node1: 2 FS1 send einval to 2 FS1 send einval to 2 [above line many times] FS1 send einval to 2 FS1 send einval to 2 FS1 grant lock on lockqueue 2 FS1 process_lockqueue_reply id 5400c2 state 0 Node 2: FS1 (31613) req reply einval 3de002b1 fr 1 r 1 7 FS1 (31613) req reply einval 3ea30356 fr 1 r 1 7 FS1 (31613) req reply einval 3f0100d5 fr 1 r 1 7 FS1 (31613) req reply einval 3df10367 fr 1 r 1 7 FS1 (31613) req reply einval 3fa600be fr 1 r 1 7 FS1 (31613) req reply einval 3f430355 fr 1 r 1 7 FS1 (31613) req reply einval 3fd20096 fr 1 r 1 7 FS1 (31613) req reply einval 3fc900d3 fr 1 r 1 7 FS1 (31613) req reply einval 3fe60375 fr 1 r 1 7 FS1 (31613) req reply einval 3f870143 fr 1 r 1 7 FS1 (31613) req reply einval 3f690239 fr 1 r 1 7 FS1 (31613) req reply einval 3eb40379 fr 1 r 1 7 FS1 (31613) req reply einval 3fb00352 fr 1 r 1 7 FS1 (31613) req reply einval 40a002f6 fr 1 r 1 7 FS1 (31613) req reply einval 3fb90265 fr 1 r 1 7 FS1 (31613) req reply einval 400b0326 fr 1 r 1 7 I have lockdump files from each node, but don't know how to interpret them. On shutdown, GFS unmount failed, and kernel panic followed: Turning off quotas: [ OK ] Unmounting file systems: umount2: Device or resource busy umount: /diskarray: device is busy umount2: Device or resource busy umount: /diskarray: device is busy CMAN: No functional network interfaces, leaving cluster CMAN: sendmsg failed: -22 CMAN: we are leaving the cluster. WARNING: dlm_emergency_shutdown SM: 00000002 sm_stop: SG still joined SM: 01000004 sm_stop: SG still joined SM: 02000006 sm_stop: SG still joined ds: 007b es: 007b ss: 0068 Process gfs_glockd (pid: 5654, threadinfo=f40d2000 task=f3c4b230) Stack: f8ade2d3 f8bb8000 00000003 f2c4ee80 f8ad98b2 f8c28ede 00000001 f33c0c7c f33c0c60 f8c1ed63 f8c55da0 d4aa4940 f33c0c60 f8c55da0 f33c0c60 f8c1e257 f33c0c60 00000001 f33c0cf4 f8c1e30e f33c0c60 f33c0c7c f8c1e431 00000001 Call Trace: [<f8ad98b2>] lm_dlm_unlock+0x14/0x1c [lock_dlm] [<f8c28ede>] gfs_lm_unlock+0x2c/0x42 [gfs] [<f8c1ed63>] gfs_glock_drop_th+0xf3/0x12d [gfs] [<f8c1e257>] rq_demote+0x7f/0x98 [gfs] [<f8c1e30e>] run_queue+0x5a/0xc1 [gfs] [<f8c1e431>] unlock_on_glock+0x1f/0x28 [gfs] [<f8c203e9>] gfs_reclaim_glock+0xc3/0x13c [gfs] [<f8c12e05>] gfs_glockd+0x39/0xde [gfs] [<c011e7b9>] default_wake_function+0x0/0xc [<c02d8522>] ret_from_fork+0x6/0x14 [<c011e7b9>] default_wake_function+0x0/0xc [<f8c12dcc>] gfs_glockd+0x0/0xde [gfs] [<c01041f5>] kernel_thread_helper+0x5/0xb Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 ff 70 18 68 ef e3 ad f8 e8 de 92 64 c7 83 c4 34 68 d3 e2 ad f8 e8 d1 92 64 c7 <0f> 0b 69 01 1b e2 ad f8 68 d5 e2 ad f8 e8 8c 8a 64 c7 5b 5e 5f <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception --- Charlie -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster