Error messages on pserver12 (opt*.log) [2011-05-13 11:41:58.812937] E [client-handshake.c:116:rpc_client_ping_timer_expired] 0-storage0-client-0: Server 10.6.0.108:24009 has not responded in the last 5 seconds, disconnecting. [2011-05-13 12:11:57.954369] E [rpc-clnt.c:199:call_bail] 0-storage0-client-0: bailing out frame type(GlusterFS Handshake) op(PING(3)) xid = 0x210x sent = 2011-05-13 11:41:53.422855. timeout = 1800 [2011-05-13 12:11:57.954415] E [rpc-clnt.c:199:call_bail] 0-storage0-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 0x209x sent = 2011-05-13 11:41:53.422846. timeout = 1800 Errors on pserver8 (the peer): [2011-05-13 14:51:26.727334] E [rdma.c:3423:rdma_handle_failed_send_completion] 0-rpc-transport/rdma: send work request on `mlx4_0' returned error wc.status = 12, wc.vendor_err = 129, post->buf = 0x43fa000, wc.byte_len = 0, post->reused = 8 9791 [2011-05-13 14:51:26.727374] E [rdma.c:3431:rdma_handle_failed_send_completion] 0-rdma: connection between client and se rver not working. check by running 'ibv_srq_pingpong'. also make sure subnet manager is running (eg: 'opensm'), or check if rdma port is valid (or active) by running 'ibv_devinfo'. contact Gluster Support Team if the problem persists. [2011-05-13 14:51:26.727617] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x77) [0x 7f397dd0ba07] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x7f397dd0b19e] (-->/usr/lib/libgfrpc.so.0(s aved_frames_destroy+0xe) [0x7f397dd0b0fe]))) 0-rpc-clnt: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2011 -05-13 14:51:22.620059 [2011-05-13 14:51:26.727670] M [client-handshake.c:1178:client_dump_version_cbk] 0-: some error, retry again later [2011-05-13 14:51:26.727686] I [client.c:1601:client_rpc_notify] 0-storage0-client-1: disconnected Could this be a bad IB card? After a reboot of pserver12 the system work again, a try to shut down and restart just the ib0 interface failed (hung) Best, Martin -----Original Message----- From: Martin Schenker [mailto:martin.schenker at profitbricks.com] Sent: Friday, May 13, 2011 3:36 PM To: 'gluster-users at gluster.org' Subject: How to debug a hanging client? Hi all! We have on server/client where the client part hangs quite often. Strace shows: 0 root at de-blnstage-c2-pserver12:~ # strace -Tfv -p 12407 ( Process 12407 attached with 6 threads - interrupt to quit [pid 12417] futex(0x2cb98a8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 12412] read(12, <unfinished ...> [pid 12411] read(11, <unfinished ...> [pid 12410] futex(0x2cb9330, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 12408] rt_sigtimedwait([HUP INT TRAP BUS USR1 USR2 PIPE ALRM TERM CHLD TTOU], NULL, NULL, 8 I can read from the server mountpoint just fine but any access to the fuse mounted glusterfs hangs and can only be killed. Any idea how to resolve this? If I try to kill all glusterfs process the kill -9 on the process root 12407 1 0 May11 ? 00:00:01 /usr/sbin/glusterfs --log-level=NORMAL --volfile-id=storage0 --volfile-server=localhost /opt/profitbricks/storage will hang as well. Just like an NFS server hang... waiting for I/O Thanks, Martin