Help, peer probe seems to get stuck on large cluster.

Yiping Peng <barius.cn@xxxxxxxxx> · Mon, 31 Aug 2015 15:40:07 +0800

Hi guys,

I've been running GlusterFS for a
couple of days and it's been nice and steady, except a minor problem: the peer
probing on my relatively large cluster seems to stuck for a long time.

Last time atinm told me in IRC (I was barius.2333 in IRC) that a cluster as
large as 50+ nodes might take a long time peer probing (o(n^2) time), and now my cluster has
expanded to 90+ nodes.

The peer probing process was started 4 days
ago, when my cluster had ~50 nodes. I probed ~40 nodes using subprocess in bash
at once, and the commands all successfully returned almost immediately (no
time-outs).

However the glusterd kept writing to
/var/lib/glusterd/peers/ during the last 4 days, and all commands related to
newly-added nodes, e.g. add-brick, mount, will time-out and fail. Also, running
“gluster peer status” on my nodes shows “Disconnected” nodes that varies over
time.

What shall I do in such situation? Do I
need to wait for the whole peer probing progress to complete, or can I simply
kill the glusterd and restart it?

Regards,
Yiping Peng
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users