I been testing 3-node GFS file system on shared fibre channel storage, and run into a couple of strange things. The 3 nodes are cl030, cl031, and cl032. 1. After running tar tests on 3 nodes for about a day, I wanted to try out the patch to get rid of the might_sleep() warning. I umounted the GFS file system on cl031 and then tried to rmmod the lock_dlm module, but couldn't because of the use count on the modules: # umount /gfs_stripe5 dlm: connecting to 2 dlm: closing connection to node 1 Debug: sleeping function called from invalid context at include/linux/rwsem.h:43in_atomic():1, irqs_disabled():0 [<c01062ae>] dump_stack+0x1e/0x30 [<c011ce47>] __might_sleep+0xb7/0xf0 [<f8ad2a85>] nodeid2con+0x25/0x1e0 [dlm] [<f8ad4102>] lowcomms_close+0x42/0x70 [dlm] [<f8ad59cc>] put_node+0x2c/0x70 [dlm] [<f8ad5b97>] release_csb+0x17/0x30 [dlm] [<f8ad60d3>] nodes_clear+0x33/0x40 [dlm] [<f8ad60f7>] ls_nodes_clear+0x17/0x30 [dlm] [<f8ad25fd>] release_lockspace+0x1fd/0x2f0 [dlm] [<f8a9ff5c>] release_gdlm+0x1c/0x30 [lock_dlm] [<f8aa0214>] lm_dlm_unmount+0x24/0x50 [lock_dlm] [<f881e496>] lm_unmount+0x46/0xac [lock_harness] [<f8ba189f>] gfs_put_super+0x30f/0x3c0 [gfs] [<c01654fa>] generic_shutdown_super+0x18a/0x1a0dlm: connecting to 1 [<c016608d>] kill_block_super+0x1d/0x40 [<c01652a1>] deactivate_super+0x81/0xa0 [<c017c6cc>] sys_umount+0x3c/0xa0 dlm: closing connection to node 2 dlm: closing connection to node 3 dlm: got connection from 2 dlm: got connection from 1 # lsmod Module Size Used by lock_dlm 39408 2 dlm 128008 1 lock_dlm gfs 296780 0 lock_harness 3868 2 lock_dlm,gfs qla2200 86432 0 qla2xxx 112064 1 qla2200 cman 128480 8 lock_dlm,dlm dm_mod 53536 0 # rmmod lock_dlm ERROR: Module lock_dlm is in use ----> At this point, the lock_dlm module would not unload because it still had a use count of 2. The "got connection" messages after the umount look strange. What do those messages mean? 2. After rebooting, cl031, I got cl031 to rejoin the cluster, but when trying to mount the mount hung: # cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 3 M cl030a 2 1 3 M cl032a 3 1 3 M cl031a # mount -t gfs /dev/sdf1 /gfs_stripe5 GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" dlm: stripefs: recover event 2 (first) dlm: stripefs: add nodes dlm: connecting to 1 ==> mount HUNG here cl031 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 1 3] DLM Lock Space: "stripefs" 18 3 join S-6,20,3 [2 1 3] ================ cl032 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 4 run - [1 2 3] DLM Lock Space: "stripefs" 18 21 update U-4,1,3 [1 2 3] GFS Mount Group: "stripefs" 19 22 run - [1 2] ================ cl030 proc]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "stripefs" 18 23 update U-4,1,3 [1 2 3] GFS Mount Group: "stripefs" 19 24 run - [1 2] It looks like some problem joining the DLM Lock Space. I have stack traces available from all 3 machines if that provides any info (http://developer.osdl.org/daniel/gfs_hang/) I reset cl031 and the other 2 nodes recovered ok: dlm: stripefs: total nodes 3 dlm: stripefs: nodes_reconfig failed -1 dlm: stripefs: recover event 76 error -1 cl032: CMAN: no HELLO from cl031a, removing from the cluster dlm: stripefs: total nodes 3 dlm: stripefs: nodes_reconfig failed 1 dlm: stripefs: recover event 69 error Anyone seen anything like this? Daniel