Hi everyone, I noticed today that one of my brick crashed a week ago. I'll noticed that there was very low load on that particular node. My setup: Gluster 3.3.0-1 on Ubuntu 12.04. gluster> volume info Volume Name: vol0 Type: Distributed-Replicate Volume ID: 211c824d-b71f-4ce2-b56a-98d0ef68cd1a Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: unic-prd-os-compute1:/data/brick0 Brick2: unic-prd-os-compute2:/data/brick0 Brick3: unic-prd-os-compute3:/data/brick0 Brick4: unic-prd-os-compute4:/data/brick0 Options Reconfigured: performance.cache-size: 256MB The log shows some strange timeout errors. [2012-07-24 01:19:05.566729] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-vol0-client-3: server 127.0.0.1:24009 has not responded in the last 42 seconds, disconnecting. [2012-07-24 01:19:05.628084] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f600e8d35b0] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f600e8d3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f600e8d314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS 3.1) op(FINODELK(30)) called at 2012-07-24 01:18:19.905575 (xid=0x44083326x) [2012-07-24 01:19:05.628132] W [client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-vol0-client-3: remote operation failed: Transport endpoint is not connected [2012-07-24 01:19:05.628214] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f600e8d35b0] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f600e8d3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f600e8d314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS 3.1) op(FINODELK(30)) called at 2012-07-24 01:18:21.717671 (xid=0x44083327x) [2012-07-24 01:19:05.628226] W [client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-vol0-client-3: remote operation failed: Transport endpoint is not connected [2012-07-24 01:19:05.628261] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f600e8d35b0] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f600e8d3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f600e8d314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-07-24 01:18:23.564658 (xid=0x44083328x) [2012-07-24 01:19:05.628280] W [client-handshake.c:275:client_ping_cbk] 0-vol0-client-3: timer must have expired [2012-07-24 01:19:05.628289] I [client.c:2090:client_rpc_notify] 0-vol0-client-3: disconnected [2012-07-24 01:19:05.628976] W [client3_1-fops.c:5267:client3_1_finodelk] 0-vol0-client-3: (8d09515c-ca0b-4048-8ebc-604f7c0d0469) remote_fd is -1. EBADFD [2012-07-24 01:19:05.629054] W [client3_1-fops.c:5267:client3_1_finodelk] 0-vol0-client-3: (770fdb61-5737-4319-9233-954b3a10dec9) remote_fd is -1. EBADFD [2012-07-24 01:49:26.750856] E [rpc-clnt.c:208:call_bail] 0-vol0-client-3: bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0x44083329x sent = 2012-07-24 01:19:16.587767. timeout = 1800 [2012-07-24 01:49:26.750907] W [client-handshake.c:1819:client_dump_version_cbk] 0-vol0-client-3: received RPC status error The gluster console show that the node is online: gluster> volume status all Status of volume: vol0 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick unic-prd-os-compute1:/data/brick0 24009 Y 2166 Brick unic-prd-os-compute2:/data/brick0 24009 Y 16270 Brick unic-prd-os-compute3:/data/brick0 24009 Y 23231 Brick unic-prd-os-compute4:/data/brick0 24009 Y 10519 Process is there: root at unic-prd-os-compute4:~# ps aux | grep 10519 root 10519 34.9 0.0 1433352 22916 ? Rsl Jul07 12063:18 /usr/sbin/glusterfsd -s localhost --volfile-id vol0.unic-prd-os-compute4.data-brick0 -p /var/lib/glusterd/vols/vol0/run/unic-prd-os-compute4-data-brick0.pid -S /tmp/cbafb6c90608cd50a23f2a8c8a4c5da5.socket --brick-name /data/brick0 -l /var/log/glusterfs/bricks/data-brick0.log --xlator-option *-posix.glusterd-uuid=8418fd20-2e16-4033-9341-1f2456ca511d --brick-port 24009 --xlator-option vol0-server.listen-port=24009 But it's not connected: Brick unic-prd-os-compute4:/data/brick0 Number of entries: 0 Status: Brick is Not connected The question is now: a) How can I detect such outages (with Nagios for example)? As I said, I just noticed it because the load dropped (graph in Zabbix). b) Can I just restart glusterd on that node to trigger the self healing? Cheers, Christian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120731/6c7e3504/attachment-0001.htm>