Krishnan Parthasarathi <kparthas@xxxxxxxxxx> wrote: > If you left the hung setup for over ten minutes from the time the bricks > went down, you should see logs corresponding to one of the above two > mechanisms in action. Let me know if you don't. Then we need to > investigate further. Yes, it is able to miserably die after 10 minutes :-) I added some debug printf to see what was hanging. Here is the path of glusterd when receiving the gluster volume heal info gd_brick_op_phase glusterd_volinfo_find glusterd_bricks_select_heal_volume -> rxlator_count = 3 glusterd_syncop_aggr_rsp_dict list_for_each_entry (pending_node, &selected, list) { First in list is rpc->conn.name = "management" gd_syncop_mgmt_brick_op glusterd_brick_op_build_payload GD_SYNCOP -> never resume } It is fine for me that glusterd_bricks_select_heal_volume() finds 3 bricks, they are the 3 remaining alive bricks. However I am surprised to see the first in the list having rpc->conn.name = "management". It should be a brick name here, right? Or is this glustershd? The logs give a hint about GD_SYNCOP not returning: [2014-09-12 04:19:35.266126] I [socket.c:3277:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1) [2014-09-12 04:19:35.266139] E [rpcsvc.c:1249:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 31) to rpc-transport (socket.management) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu@xxxxxxxxxx _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel