Dear colleagues, Your advices will be greatly appreciated. I have another small GFS2 cluster. 2 nodes connected to the same iSCSI-target. Tonight something has happen and now both nodes can’t work with the mounted filesystem anymore. Processes that opened files on the filesystem are keeping files opened and working with them, but I can’t open new files, I even can’t get the list of files on the mountpoint by “ls” command. Both nodes are joined: Node Sts Inc Joined Name 1 M 388 2013-11-26 03:43:01 *** 2 M 360 2013-11-11 07:39:22 *** That’s what “gfs_control dump” says: 1384148367 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/gfs_controld.log 1384148367 gfs_controld 3.0.12.1 started 1384148367 cluster node 1 added seq 364 1384148367 cluster node 2 added seq 364 1384148367 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/gfs_controld.log 1384148367 group_mode 3 compat 0 1384148367 setup_cpg_daemon 14 1384148367 gfs:controld conf 2 1 0 memb 1 2 join 2 left 1384148367 run protocol from nodeid 1 1384148367 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 1384148372 client connection 5 fd 16 1384148372 join: /mnt/psv4 gfs2 lock_dlm ckvm1_pod1:psv4 rw,noatime,nodiratime /dev/dm-0 1384148372 psv4 join: cluster name matches: ckvm1_pod1 1384148372 psv4 process_dlmcontrol register 0 1384148372 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 2 left 1384148372 psv4 add_change cg 1 joined nodeid 2 1384148372 psv4 add_change cg 1 we joined 1384148372 psv4 add_change cg 1 counts member 2 joined 1 remove 0 failed 0 1384148372 psv4 wait_conditions skip for zero started_count 1384148372 psv4 send_start cg 1 id_count 2 om 0 nm 2 oj 0 nj 0 1384148372 psv4 receive_start 2:1 len 104 1384148372 psv4 match_change 2:1 matches cg 1 1384148372 psv4 wait_messages cg 1 need 1 of 2 1384148372 psv4 receive_start 1:2 len 104 1384148372 psv4 match_change 1:2 matches cg 1 1384148372 psv4 wait_messages cg 1 got all 2 1384148372 psv4 pick_first_recovery_master old 1 1384148372 psv4 sync_state first_recovery_needed master 1 1384148372 psv4 create_old_nodes 1 jid 0 ro 0 spect 0 kernel_mount_done 0 error 0 1384148372 psv4 create_new_nodes 2 ro 0 spect 0 1384148372 psv4 create_new_journals 2 gets jid 1 1384148373 psv4 receive_first_recovery_done from 1 master 1 mount_client_notified 0 1384148373 psv4 start_kernel cg 1 member_count 2 1384148373 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to 0 1384148373 psv4 set open /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block error -1 2 1384148373 psv4 client_reply_join_full ci 5 result 0 hostdata=jid=1:id=2447518500:first=0 1384148373 client_reply_join psv4 ci 5 result 0 1384148373 psv4 wait_recoveries done 1384148373 uevent add gfs2 /fs/gfs2/ckvm1_pod1:psv4 1384148373 psv4 ping_kernel_mount 0 1384148373 psv4 receive_mount_done from 1 result 0 1384148373 psv4 wait_recoveries done 1384148373 uevent change gfs2 /fs/gfs2/ckvm1_pod1:psv4 1384148373 psv4 recovery_uevent jid 1 ignore 1384148373 uevent online gfs2 /fs/gfs2/ckvm1_pod1:psv4 1384148373 psv4 ping_kernel_mount 0 1384148373 mount_done: psv4 result 0 1384148373 psv4 receive_mount_done from 2 result 0 1384148373 psv4 wait_recoveries done 1385430013 cluster node 1 removed seq 368 1385430013 gfs:controld conf 1 0 1 memb 2 join left 1 1385430013 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1 1385430013 psv4 add_change cg 2 remove nodeid 1 reason 3 1385430013 psv4 add_change cg 2 counts member 1 joined 0 remove 1 failed 1 1385430013 psv4 stop_kernel 1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to 1 1385430013 psv4 check_dlm_notify nodeid 1 begin 1385430013 psv4 process_dlmcontrol notified nodeid 1 result -11 1385430013 psv4 check_dlm_notify result -11 will retry nodeid 1 1385430013 psv4 check_dlm_notify nodeid 1 begin 1385430013 psv4 process_dlmcontrol notified nodeid 1 result 0 1385430013 psv4 check_dlm_notify done 1385430013 psv4 send_start cg 2 id_count 2 om 1 nm 0 oj 0 nj 1 1385430013 psv4 receive_start 2:2 len 104 1385430013 psv4 match_change 2:2 matches cg 2 1385430013 psv4 wait_messages cg 2 got all 1 1385430013 psv4 sync_state first_recovery_msg 1385430013 psv4 set_failed_journals jid 0 nodeid 1 1385430013 psv4 wait_recoveries jid 0 nodeid 1 unrecovered 1385430013 psv4 start_journal_recovery jid 0 1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/recover to 0 1385430044 cluster node 1 added seq 372 1385430044 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left 1385430044 psv4 add_change cg 3 joined nodeid 1 1385430044 psv4 add_change cg 3 counts member 2 joined 1 remove 0 failed 0 1385430044 psv4 check_dlm_notify done 1385430044 psv4 send_start cg 3 id_count 3 om 1 nm 1 oj 1 nj 0 1385430044 cpg_mcast_joined retried 1 start 1385430044 gfs:controld conf 2 1 0 memb 1 2 join 1 left 1385430044 psv4 receive_start 2:3 len 116 1385430044 psv4 match_change 2:3 matches cg 3 1385430044 psv4 wait_messages cg 3 need 1 of 2 1385430044 psv4 receive_start 1:4 len 116 1385430044 psv4 match_change 1:4 matches cg 3 1385430044 receive_start 1:4 add node with started_count 3 1385430044 psv4 wait_messages cg 3 need 1 of 2 1385430088 cluster node 1 removed seq 376 1385430088 gfs:controld conf 1 0 1 memb 2 join left 1 1385430088 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1 1385430088 psv4 add_change cg 4 remove nodeid 1 reason 3 1385430088 psv4 add_change cg 4 counts member 1 joined 0 remove 1 failed 1 1385430088 psv4 check_dlm_notify nodeid 1 begin 1385430088 psv4 process_dlmcontrol notified nodeid 1 result 0 1385430088 psv4 check_dlm_notify done 1385430088 psv4 send_start cg 4 id_count 2 om 1 nm 0 oj 1 nj 0 1385430088 psv4 receive_start 2:4 len 104 1385430088 psv4 match_change 2:4 skip 3 already start 1385430088 psv4 match_change 2:4 matches cg 4 1385430088 psv4 wait_messages cg 4 got all 1 1385430088 psv4 sync_state first_recovery_msg 1385430088 psv4 set_failed_journals no journal for nodeid 1 1385430088 psv4 wait_recoveries jid 0 nodeid 1 unrecovered 1385430092 cluster node 1 added seq 380 1385430092 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left 1385430092 psv4 add_change cg 5 joined nodeid 1 1385430092 psv4 add_change cg 5 counts member 2 joined 1 remove 0 failed 0 1385430092 psv4 check_dlm_notify done 1385430092 psv4 send_start cg 5 id_count 3 om 1 nm 1 oj 1 nj 0 1385430092 cpg_mcast_joined retried 1 start 1385430092 gfs:controld conf 2 1 0 memb 1 2 join 1 left 1385430092 psv4 receive_start 2:5 len 116 1385430092 psv4 match_change 2:5 matches cg 5 1385430092 psv4 wait_messages cg 5 need 1 of 2 1385430092 psv4 receive_start 1:6 len 116 1385430092 psv4 match_change 1:6 matches cg 5 1385430092 receive_start 1:6 add node with started_count 4 1385430092 psv4 wait_messages cg 5 need 1 of 2 1385430143 cluster node 1 removed seq 384 1385430143 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1 1385430143 psv4 add_change cg 6 remove nodeid 1 reason 3 1385430143 psv4 add_change cg 6 counts member 1 joined 0 remove 1 failed 1 1385430143 psv4 check_dlm_notify nodeid 1 begin 1385430143 gfs:controld conf 1 0 1 memb 2 join left 1 1385430143 psv4 process_dlmcontrol notified nodeid 1 result 0 1385430143 psv4 check_dlm_notify done 1385430143 psv4 send_start cg 6 id_count 2 om 1 nm 0 oj 1 nj 0 1385430143 psv4 receive_start 2:6 len 104 1385430143 psv4 match_change 2:6 skip 5 already start 1385430143 psv4 match_change 2:6 matches cg 6 1385430143 psv4 wait_messages cg 6 got all 1 1385430143 psv4 sync_state first_recovery_msg 1385430143 psv4 set_failed_journals no journal for nodeid 1 1385430143 psv4 wait_recoveries jid 0 nodeid 1 unrecovered 1385430181 cluster node 1 added seq 388 1385430181 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left 1385430181 psv4 add_change cg 7 joined nodeid 1 1385430181 psv4 add_change cg 7 counts member 2 joined 1 remove 0 failed 0 1385430181 psv4 check_dlm_notify done 1385430181 psv4 send_start cg 7 id_count 3 om 1 nm 1 oj 1 nj 0 1385430181 cpg_mcast_joined retried 1 start 1385430181 gfs:controld conf 2 1 0 memb 1 2 join 1 left 1385430181 psv4 receive_start 2:7 len 116 1385430181 psv4 match_change 2:7 matches cg 7 1385430181 psv4 wait_messages cg 7 need 1 of 2 1385430181 psv4 receive_start 1:8 len 116 1385430181 psv4 match_change 1:8 matches cg 7 1385430181 receive_start 1:8 add node with started_count 5 1385430181 psv4 wait_messages cg 7 need 1 of 2 I can’t reboot nodes, they’re pretty busy, but, of course, I’d like to make that GFS2-filesystem working again. There’s what I’d got in the log-file when that happened: Nov 26 03:40:11 host2 corosync[2596]: [TOTEM ] A processor failed, forming new configuration. Nov 26 03:40:12 host2 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 5576348348, last ping 5576353348, now 5576358348 Nov 26 03:40:12 host2 kernel: connection1:0: detected conn error (1011) Nov 26 03:40:13 host2 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3) Nov 26 03:40:13 host2 corosync[2596]: [CMAN ] quorum lost, blocking activity Nov 26 03:40:13 host2 corosync[2596]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Nov 26 03:40:13 host2 corosync[2596]: [QUORUM] Members[1]: 2 Nov 26 03:40:13 host2 corosync[2596]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 26 03:40:13 host2 corosync[2596]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.2) ; members(old:2 left:1) Nov 26 03:40:13 host2 corosync[2596]: [MAIN ] Completed service synchronization, ready to provide service. Nov 26 03:40:13 host2 kernel: dlm: closing connection to node 1 Nov 26 03:40:13 host2 kernel: GFS2: fsid=ckvm1_pod1:psv4.1: jid=0: Trying to acquire journal lock... Nov 26 03:40:44 host2 iscsid: connection1:0 is operational after recovery (3 attempts) Nov 26 03:40:44 host2 corosync[2596]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 26 03:40:44 host2 corosync[2596]: [CMAN ] quorum regained, resuming activity Nov 26 03:40:44 host2 corosync[2596]: [QUORUM] This node is within the primary component and will provide service. Nov 26 03:40:44 host2 corosync[2596]: [QUORUM] Members[2]: 1 2 Nov 26 03:40:44 host2 corosync[2596]: [QUORUM] Members[2]: 1 2 Nov 26 03:40:44 host2 corosync[2596]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.1) ; members(old:1 left:0) Nov 26 03:40:44 host2 corosync[2596]: [MAIN ] Completed service synchronization, ready to provide service. Nov 26 03:40:44 host2 gfs_controld[2727]: receive_start 1:4 add node with started_count 3 Nov 26 03:40:44 host2 fenced[2652]: receive_start 1:4 add node with started_count 2 Nov 26 03:41:26 host2 corosync[2596]: [TOTEM ] A processor failed, forming new configuration. Nov 26 03:41:28 host2 corosync[2596]: [CMAN ] quorum lost, blocking activity Nov 26 03:41:28 host2 corosync[2596]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Nov 26 03:41:28 host2 corosync[2596]: [QUORUM] Members[1]: 2 Nov 26 03:41:28 host2 corosync[2596]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 26 03:41:28 host2 corosync[2596]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.2) ; members(old:2 left:1) Nov 26 03:41:28 host2 corosync[2596]: [MAIN ] Completed service synchronization, ready to provide service. Nov 26 03:41:28 host2 kernel: dlm: closing connection to node 1 Nov 26 03:41:29 host2 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 5576425428, last ping 5576430428, now 5576435428 Nov 26 03:41:29 host2 kernel: connection1:0: detected conn error (1011) Nov 26 03:41:30 host2 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3) Nov 26 03:41:32 host2 corosync[2596]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 26 03:41:32 host2 corosync[2596]: [CMAN ] quorum regained, resuming activity Nov 26 03:41:32 host2 corosync[2596]: [QUORUM] This node is within the primary component and will provide service. Nov 26 03:41:32 host2 corosync[2596]: [QUORUM] Members[2]: 1 2 Nov 26 03:41:32 host2 corosync[2596]: [QUORUM] Members[2]: 1 2 Nov 26 03:41:32 host2 corosync[2596]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.1) ; members(old:1 left:0) Nov 26 03:41:32 host2 corosync[2596]: [MAIN ] Completed service synchronization, ready to provide service. Nov 26 03:41:32 host2 fenced[2652]: receive_start 1:6 add node with started_count 2 Nov 26 03:41:32 host2 gfs_controld[2727]: receive_start 1:6 add node with started_count 4 Nov 26 03:41:37 host2 iscsid: connection1:0 is operational after recovery (1 attempts) Nov 26 03:42:19 host2 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 5576475399, last ping 5576480399, now 5576485399 Nov 26 03:42:19 host2 kernel: connection1:0: detected conn error (1011) Nov 26 03:42:20 host2 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3) Nov 26 03:42:21 host2 corosync[2596]: [TOTEM ] A processor failed, forming new configuration. Nov 26 03:42:23 host2 corosync[2596]: [CMAN ] quorum lost, blocking activity Nov 26 03:42:23 host2 corosync[2596]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Nov 26 03:42:23 host2 corosync[2596]: [QUORUM] Members[1]: 2 Nov 26 03:42:23 host2 corosync[2596]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 26 03:42:23 host2 corosync[2596]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.2) ; members(old:2 left:1) Nov 26 03:42:23 host2 corosync[2596]: [MAIN ] Completed service synchronization, ready to provide service. Nov 26 03:42:23 host2 kernel: dlm: closing connection to node 1 Nov 26 03:42:41 host2 kernel: INFO: task kslowd001:2942 blocked for more than 120 seconds. Nov 26 03:42:41 host2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 26 03:42:41 host2 kernel: kslowd001 D 000000000000000b 0 2942 2 0x00000080 Nov 26 03:42:41 host2 kernel: ffff88086b29d958 0000000000000046 0000000000000102 0000005000000002 Nov 26 03:42:41 host2 kernel: fffffffffffffffc 000000000000010e 0000003f00000002 fffffffffffffffc Nov 26 03:42:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8 000000000000fb88 ffff88086b29bab8 Nov 26 03:42:41 host2 kernel: Call Trace: Nov 26 03:42:41 host2 kernel: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0 Nov 26 03:42:41 host2 kernel: [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30 Nov 26 03:42:41 host2 kernel: [<ffffffff8127e634>] call_rwsem_down_read_failed+0x14/0x30 Nov 26 03:42:41 host2 kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30 Nov 26 03:42:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0 [dlm] Nov 26 03:42:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf+0x484/0x5f0 Nov 26 03:42:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock+0xf1/0x130 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast+0x0/0x50 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa063a385>] do_xmote+0x1a5/0x280 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf+0x34/0x40 Nov 26 03:42:41 host2 kernel: [<ffffffffa063a551>] run_queue+0xf1/0x1d0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq+0x21e/0x3d0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac71>] gfs2_glock_nq_num+0x61/0xa0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa064eca3>] gfs2_recover_work+0x93/0x7b0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80 Nov 26 03:42:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac69>] ? gfs2_glock_nq_num+0x59/0xa0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff8106335b>] ? enqueue_task_fair+0xfb/0x100 Nov 26 03:42:41 host2 kernel: [<ffffffff81108093>] slow_work_execute+0x233/0x310 Nov 26 03:42:41 host2 kernel: [<ffffffff811082c7>] slow_work_thread+0x157/0x360 Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 Nov 26 03:42:41 host2 kernel: [<ffffffff81108170>] ? slow_work_thread+0x0/0x360 Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Nov 26 03:42:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for more than 120 seconds. Nov 26 03:42:41 host2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 26 03:42:41 host2 kernel: gfs2_quotad D 0000000000000001 0 2950 2 0x00000080 Nov 26 03:42:41 host2 kernel: ffff88086afdfc20 0000000000000046 0000000000000000 ffffffffa0605f4d Nov 26 03:42:41 host2 kernel: 0000000000000000 ffff88106c505800 ffff88086afdfc50 ffffffffa0604708 Nov 26 03:42:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8 000000000000fb88 ffff88086afddaf8 Nov 26 03:42:41 host2 kernel: Call Trace: Nov 26 03:42:41 host2 kernel: [<ffffffffa0605f4d>] ? dlm_put_lockspace+0x1d/0x40 [dlm] Nov 26 03:42:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock+0x98/0x1e0 [dlm] Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa063757e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90 Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90 Nov 26 03:42:41 host2 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50 Nov 26 03:42:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait+0x45/0x90 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0 Nov 26 03:42:41 host2 kernel: [<ffffffffa0653658>] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff814fe75a>] ? schedule_timeout+0x19a/0x2e0 Nov 26 03:42:41 host2 kernel: [<ffffffffa0653650>] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa064b9d7>] quotad_check_timeo+0x57/0xb0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad+0x234/0x2b0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 Nov 26 03:42:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad+0x0/0x2b0 [gfs2] Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Nov 26 03:42:54 host2 iscsid: connect to 192.168.1.161:3260 failed (No route to host) Nov 26 03:43:00 host2 iscsid: connect to 192.168.1.161:3260 failed (No route to host) Nov 26 03:43:01 host2 corosync[2596]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 26 03:43:01 host2 corosync[2596]: [CMAN ] quorum regained, resuming activity Nov 26 03:43:01 host2 corosync[2596]: [QUORUM] This node is within the primary component and will provide service. Nov 26 03:43:01 host2 corosync[2596]: [QUORUM] Members[2]: 1 2 Nov 26 03:43:01 host2 corosync[2596]: [QUORUM] Members[2]: 1 2 Nov 26 03:43:01 host2 corosync[2596]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.1) ; members(old:1 left:0) Nov 26 03:43:01 host2 corosync[2596]: [MAIN ] Completed service synchronization, ready to provide service. Nov 26 03:43:01 host2 gfs_controld[2727]: receive_start 1:8 add node with started_count 5 Nov 26 03:43:01 host2 fenced[2652]: receive_start 1:8 add node with started_count 2 Nov 26 03:43:03 host2 iscsid: connection1:0 is operational after recovery (5 attempts) Nov 26 03:44:41 host2 kernel: INFO: task kslowd001:2942 blocked for more than 120 seconds. Nov 26 03:44:41 host2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 26 03:44:41 host2 kernel: kslowd001 D 000000000000000b 0 2942 2 0x00000080 Nov 26 03:44:41 host2 kernel: ffff88086b29d958 0000000000000046 0000000000000102 0000005000000002 Nov 26 03:44:41 host2 kernel: fffffffffffffffc 000000000000010e 0000003f00000002 fffffffffffffffc Nov 26 03:44:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8 000000000000fb88 ffff88086b29bab8 Nov 26 03:44:41 host2 kernel: Call Trace: Nov 26 03:44:41 host2 kernel: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0 Nov 26 03:44:41 host2 kernel: [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30 Nov 26 03:44:41 host2 kernel: [<ffffffff8127e634>] call_rwsem_down_read_failed+0x14/0x30 Nov 26 03:44:41 host2 kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30 Nov 26 03:44:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0 [dlm] Nov 26 03:44:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf+0x484/0x5f0 Nov 26 03:44:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock+0xf1/0x130 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast+0x0/0x50 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa063a385>] do_xmote+0x1a5/0x280 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf+0x34/0x40 Nov 26 03:44:41 host2 kernel: [<ffffffffa063a551>] run_queue+0xf1/0x1d0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq+0x21e/0x3d0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac71>] gfs2_glock_nq_num+0x61/0xa0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa064eca3>] gfs2_recover_work+0x93/0x7b0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80 Nov 26 03:44:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac69>] ? gfs2_glock_nq_num+0x59/0xa0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff8106335b>] ? enqueue_task_fair+0xfb/0x100 Nov 26 03:44:41 host2 kernel: [<ffffffff81108093>] slow_work_execute+0x233/0x310 Nov 26 03:44:41 host2 kernel: [<ffffffff811082c7>] slow_work_thread+0x157/0x360 Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 Nov 26 03:44:41 host2 kernel: [<ffffffff81108170>] ? slow_work_thread+0x0/0x360 Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Nov 26 03:44:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for more than 120 seconds. Nov 26 03:44:41 host2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 26 03:44:41 host2 kernel: gfs2_quotad D 0000000000000001 0 2950 2 0x00000080 Nov 26 03:44:41 host2 kernel: ffff88086afdfc20 0000000000000046 0000000000000000 ffffffffa0605f4d Nov 26 03:44:41 host2 kernel: 0000000000000000 ffff88106c505800 ffff88086afdfc50 ffffffffa0604708 Nov 26 03:44:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8 000000000000fb88 ffff88086afddaf8 Nov 26 03:44:41 host2 kernel: Call Trace: Nov 26 03:44:41 host2 kernel: [<ffffffffa0605f4d>] ? dlm_put_lockspace+0x1d/0x40 [dlm] Nov 26 03:44:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock+0x98/0x1e0 [dlm] Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa063757e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90 Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90 Nov 26 03:44:41 host2 kernel: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50 Nov 26 03:44:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait+0x45/0x90 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq+0x237/0x3d0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0 Nov 26 03:44:41 host2 kernel: [<ffffffffa0653658>] gfs2_statfs_sync+0x58/0x1b0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff814fe75a>] ? schedule_timeout+0x19a/0x2e0 Nov 26 03:44:41 host2 kernel: [<ffffffffa0653650>] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa064b9d7>] quotad_check_timeo+0x57/0xb0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad+0x234/0x2b0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 Nov 26 03:44:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad+0x0/0x2b0 [gfs2] Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 What would you do in the same case? Is it possible to restart GFS2 without rebooting nodes? Thank you very much for any help. -- V.Melnik |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster