Gluster 3.3, brick crashed

wittwerch at gmail.com (Christian Wittwer) · Tue, 31 Jul 2012 14:04:25 +0200

Hi everyone,
I noticed today that one of my brick crashed a week ago. I'll noticed that
there was very low load on that particular node.
My setup: Gluster 3.3.0-1 on Ubuntu 12.04.

gluster> volume info

Volume Name: vol0
Type: Distributed-Replicate
Volume ID: 211c824d-b71f-4ce2-b56a-98d0ef68cd1a
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: unic-prd-os-compute1:/data/brick0
Brick2: unic-prd-os-compute2:/data/brick0
Brick3: unic-prd-os-compute3:/data/brick0
Brick4: unic-prd-os-compute4:/data/brick0
Options Reconfigured:
performance.cache-size: 256MB

The log shows some strange timeout errors.

[2012-07-24 01:19:05.566729] C
[client-handshake.c:126:rpc_client_ping_timer_expired] 0-vol0-client-3:
server 127.0.0.1:24009 has not responded in the last 42 seconds,
disconnecting.
[2012-07-24 01:19:05.628084] E [rpc-clnt.c:373:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f600e8d35b0]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0)
[0x7f600e8d3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x7f600e8d314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS
3.1) op(FINODELK(30)) called at 2012-07-24 01:18:19.905575 (xid=0x44083326x)
[2012-07-24 01:19:05.628132] W
[client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-vol0-client-3: remote
operation failed: Transport endpoint is not connected
[2012-07-24 01:19:05.628214] E [rpc-clnt.c:373:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f600e8d35b0]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0)
[0x7f600e8d3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x7f600e8d314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS
3.1) op(FINODELK(30)) called at 2012-07-24 01:18:21.717671 (xid=0x44083327x)
[2012-07-24 01:19:05.628226] W
[client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-vol0-client-3: remote
operation failed: Transport endpoint is not connected
[2012-07-24 01:19:05.628261] E [rpc-clnt.c:373:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xd0) [0x7f600e8d35b0]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0)
[0x7f600e8d3220] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x7f600e8d314e]))) 0-vol0-client-3: forced unwinding frame type(GlusterFS
Handshake) op(PING(3)) called at 2012-07-24 01:18:23.564658
(xid=0x44083328x)
[2012-07-24 01:19:05.628280] W [client-handshake.c:275:client_ping_cbk]
0-vol0-client-3: timer must have expired
[2012-07-24 01:19:05.628289] I [client.c:2090:client_rpc_notify]
0-vol0-client-3: disconnected
[2012-07-24 01:19:05.628976] W [client3_1-fops.c:5267:client3_1_finodelk]
0-vol0-client-3:  (8d09515c-ca0b-4048-8ebc-604f7c0d0469) remote_fd is -1.
EBADFD
[2012-07-24 01:19:05.629054] W [client3_1-fops.c:5267:client3_1_finodelk]
0-vol0-client-3:  (770fdb61-5737-4319-9233-954b3a10dec9) remote_fd is -1.
EBADFD
[2012-07-24 01:49:26.750856] E [rpc-clnt.c:208:call_bail] 0-vol0-client-3:
bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0x44083329x sent =
2012-07-24 01:19:16.587767. timeout = 1800
[2012-07-24 01:49:26.750907] W
[client-handshake.c:1819:client_dump_version_cbk] 0-vol0-client-3: received
RPC status error

The gluster console show that the node is online:

gluster> volume status all
Status of volume: vol0
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick unic-prd-os-compute1:/data/brick0                 24009   Y       2166
Brick unic-prd-os-compute2:/data/brick0                 24009   Y
16270
Brick unic-prd-os-compute3:/data/brick0                 24009   Y
23231
Brick unic-prd-os-compute4:/data/brick0                 24009   Y
10519

Process is there:

root at unic-prd-os-compute4:~# ps aux | grep 10519
root     10519 34.9  0.0 1433352 22916 ?       Rsl  Jul07 12063:18
/usr/sbin/glusterfsd -s localhost --volfile-id
vol0.unic-prd-os-compute4.data-brick0 -p
/var/lib/glusterd/vols/vol0/run/unic-prd-os-compute4-data-brick0.pid -S
/tmp/cbafb6c90608cd50a23f2a8c8a4c5da5.socket --brick-name /data/brick0 -l
/var/log/glusterfs/bricks/data-brick0.log --xlator-option
*-posix.glusterd-uuid=8418fd20-2e16-4033-9341-1f2456ca511d --brick-port
24009 --xlator-option vol0-server.listen-port=24009

But it's not connected:

Brick unic-prd-os-compute4:/data/brick0
Number of entries: 0
Status: Brick is Not connected

The question is now:
a) How can I detect such outages (with Nagios for example)? As I said, I
just noticed it because the load dropped (graph in Zabbix).
b) Can I just restart glusterd on that node to trigger the self healing?

Cheers,
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20120731/6c7e3504/attachment-0001.htm>