> > Hello, > > I have managed to clear pending 0031 to 0028 operation by shutting down > all the nodes , deleting rb_mount file and editing rb_state file. > However this did not help reintroduce 00031 to the cluster (0022 also > but it is offline so no chance to do peer probe). > > I have tried to replicate node removal and reattaching on other cluster > and node did seem to be accepted after peer probe but due to no spare > servers available for that cluster I was not able to do "brick replace". > > In the gluster config files I do not find anything that might indicate > that node is not part of cluster: > * Node is part of glustervmstore-client-24 > * Subvolume is defined in replica set glustervmstore-replicate-12 > * Replica set is defined as part of main volume. > Everything looks like other replica sets. > > *** COMMAND: > gluster volume replace-brick glustervmstore 00031:/mnt/vmstore/brick > 00028:/mnt/vmstore/brick start > brick: 00031:/mnt/vmstore/brick does not exist in volume: glustervmstore > > *** Log file /var/log/glusterfs/etc-glusterfs-glusterd.vol.log extracts: > On the missing node 00031 > > [2013-06-18 12:45:09.328647] I [socket.c:1798:socket_event_handler] > 0-transport: disconnecting now > [2013-06-18 12:45:11.983650] I > [glusterd-handler.c:502:glusterd_handle_cluster_lock] 0-glusterd: > Received LOCK from uuid: 2d46fb6f-a36a-454a-b0ba-7df324746737 > [2013-06-18 12:45:11.983723] I [glusterd-utils.c:285:glusterd_lock] > 0-glusterd: Cluster lock held by 2d46fb6f-a36a-454a-b0ba-7df324746737 > [2013-06-18 12:45:11.983793] I > [glusterd-handler.c:1322:glusterd_op_lock_send_resp] 0-glusterd: > Responded, ret: 0 > [2013-06-18 12:45:11.991438] I > [glusterd-handler.c:1366:glusterd_handle_cluster_unlock] 0-glusterd: > Received UNLOCK from uuid: 2d46fb6f-a36a-454a-b0ba-7df324746737 > [2013-06-18 12:45:11.991537] I > [glusterd-handler.c:1342:glusterd_op_unlock_send_resp] 0-glusterd: > Responded to unlock, ret: 0 > [2013-06-18 12:45:12.329047] I [socket.c:1798:socket_event_handler] > 0-transport: disconnecting now > [2013-06-18 12:45:15.329431] I [socket.c:1798:socket_event_handler] > 0-transport: disconnecting now > > On the node I am attempting to do brick replace 00031 to 00028: > > [2013-06-18 12:45:11.982606] I > [glusterd-replace-brick.c:98:glusterd_handle_replace_brick] 0-glusterd: > Received replace brick req > [2013-06-18 12:45:11.982691] I > [glusterd-replace-brick.c:147:glusterd_handle_replace_brick] 0-glusterd: > Received replace brick start request > [2013-06-18 12:45:11.982754] I [glusterd-utils.c:285:glusterd_lock] > 0-glusterd: Cluster lock held by 2d46fb6f-a36a-454a-b0ba-7df324746737 > [2013-06-18 12:45:11.982777] I > [glusterd-handler.c:463:glusterd_op_txn_begin] 0-management: Acquired > local lock > [2013-06-18 12:45:11.984772] I > [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd: > Received ACC from uuid: f7860586-f92c-4114-8336-823c223f18c0 > ..... LOTS of ACC messages ..... > [2013-06-18 12:45:11.987076] I > [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd: > Received ACC from uuid: c49cfdbe-2af1-4050-bda1-bdd5fd3926b6 > [2013-06-18 12:45:11.987116] I > [glusterd-rpc-ops.c:548:glusterd3_1_cluster_lock_cbk] 0-glusterd: > Received ACC from uuid: 7e9e1cf3-214e-45c8-aa37-4da0def7fb6b > [2013-06-18 12:45:11.987196] I > [glusterd-utils.c:857:glusterd_volume_brickinfo_get_by_brick] 0-: brick: > 00031:/mnt/vmstore/brick > [2013-06-18 12:45:11.990732] E > [glusterd-op-sm.c:1999:glusterd_op_ac_send_stage_op] 0-: Staging failed > [2013-06-18 12:45:11.990785] I > [glusterd-op-sm.c:2039:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op > req to 0 peers > [2013-06-18 12:45:11.992356] I > [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd: > Received ACC from uuid: f0fcb6dd-c4ef-4751-b92e-db27ffd252d4 > [2013-06-18 12:45:11.992480] I > [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd: > Received ACC from uuid: 33c008a5-9c11-44d7-95c6-58362211bbe8 > ..... LOTS of ACC messages ..... > [2013-06-18 12:45:11.994447] I > [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd: > Received ACC from uuid: 444a54c6-d4f5-4407-905c-aef4e56e02be > [2013-06-18 12:45:11.994483] I > [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd: > Received ACC from uuid: c49cfdbe-2af1-4050-bda1-bdd5fd3926b6 > [2013-06-18 12:45:11.994527] I > [glusterd-rpc-ops.c:607:glusterd3_1_cluster_unlock_cbk] 0-glusterd: > Received ACC from uuid: 7e9e1cf3-214e-45c8-aa37-4da0def7fb6b > [2013-06-18 12:45:11.994555] I > [glusterd-op-sm.c:2653:glusterd_op_txn_complete] 0-glusterd: Cleared > local lock > [2013-06-18 12:45:12.270020] I [socket.c:1798:socket_event_handler] > 0-transport: disconnecting now > > My attempt manually delete affected replica sets from > /var/lib/glusterd/vols/glustervmstore > glustervmstore-fuse.vol > info > trusted-glustervmstore-fuse.vol > /var/lib/glusterd/glustershd > glustershd-server.vol > files totally failed as glusterfs service failed to start at all > complaining about unknown keys. All volfiles are autogenerated based on the info available in the other files in /var/lib/glusterd/vols/<name>/ (like ./info, ./bricks/*). So to manually fix your "situation", please make sure the contents in the files ./info, ./node_state.info ./rbstate ./bricks/* are "proper" (you can either share them with me offline, or compare them with another volume which is good), and issue a "gluster volume reset <volname>" to re-write fresh volfiles. It is also a good idea to double check the contents of /var/lib/glusterd/peers/* is proper too. Doing these manual steps and restarting all processes should recover you from pretty much any situation. Back to the cause of the problem - it appears to be the case that the ongoing replace-brick got messed up when yet another server died. A different way of achieving what you want, is to use add-brick + remove-brick for decommissioning servers (i.e, add-brick the new server - 00028, and "remove-brick start" the old one - 00031, and "remove-brick commit" once all the data has drained out). Moving forward this will be the recommended way to decommission servers. Use replace-brick to only replace an already dead server - 00022 with its replacement). Let us know if the above steps took you back to a healthy state or if you faced further issues. Avati > I am using Semiosis 3.3.1 package on Ubuntu 12.04: > dpkg -l | grep gluster > rc glusterfs 3.3.0-1 > clustered file-system > ii glusterfs-client 3.3.1-ubuntu1~precise8 > clustered file-system (client package) > ii glusterfs-common 3.3.1-ubuntu1~precise8 > GlusterFS common libraries and translator modules > ii glusterfs-server 3.3.1-ubuntu1~precise8 > clustered file-system (server package) > > Thank you > > -----Original Message----- > From: Vijay Bellur [mailto:vbellur at redhat.com <mailto:vbellur at redhat.com>] > Sent: 18 June 2013 14:33 > To: Piliponis, Elvinas : RBB COO > Cc: gluster-users at gluster.org <mailto:gluster-users at gluster.org> > Subject: Re: Unable to remove / replace faulty bricks > > On 06/18/2013 11:43 AM, elvinas.piliponis at barclays.com > <mailto:elvinas.piliponis at barclays.com> wrote: > > Hello, > > > > When trying to recover from failed node and replace brick with spare > > one I have trashed my cluster and now it is in stuck state. > > > > Any ideas, how to reintroduce/remove those nodes and bring peace and > > order to cluster? > > > > There was a pending brick replacement operation from 0031 to 0028 (it > > is still not commited according to rbstate file) > > > > There was a hardware failure on 0022 node > > > > I was not able to commit replace brick 0031 due to 0022 was not > > responding and not giving cluster lock to requesting node. > > > > I was not able to start replacement 0022 to 0028 due to pending brick > > replacement > > > > I have forced peer removal from cluster, hoping that afterwards I > > would be able to complete operations. Unfortunately I have removes not > > only > > 0022 but 0031 also. > > > > I have peer probed 0031 successfully. But now gluster volume info and > > volume status both lists 0031 node. But when I attempt to do a brick > > operation I do get: > > > > gluster volume remove-brick glustervmstore 0031:/mnt/vmstore/brick > > 0036:/mnt/vmstore/brick force > > > > Removing brick(s) can result in data loss. Do you want to Continue? > > (y/n) y > > > > Incorrect brick 0031:/mnt/vmstore/brick for volume glustervmstore > > > > gluster volume replace-brick glustervmstore 0031:/mnt/vmstore/brick > > 0028:/mnt/vmstore/brick commit force > > > > brick: 0031:/mnt/vmstore/brick does not exist in volume: > > glustervmstore > > > Looks like these commands are being rejected from a node where the > volume information is not current. Can you please provide glusterd logs > from the node where these commands were issued? > > Thanks, > Vijay > > > This e-mail and any attachments are confidential and intended > solely for the addressee and may also be privileged or exempt from > disclosure under applicable law. If you are not the addressee, or > have received this e-mail in error, please notify the sender > immediately, delete it from your system and do not copy, disclose > or otherwise act upon any part of this e-mail or its attachments. > > Internet communications are not guaranteed to be secure or > virus-free. > The Barclays Group does not accept responsibility for any loss > arising from unauthorised access to, or interference with, any > Internet communications by any third party, or from the > transmission of any viruses. Replies to this e-mail may be > monitored by the Barclays Group for operational or business > reasons. > > Any opinion or other information in this e-mail or its attachments > that does not relate to the business of the Barclays Group is > personal to the sender and is not given or endorsed by the Barclays > Group. > > Barclays Bank PLC. Registered in England and Wales (registered no. > 1026167). > Registered Office: 1 Churchill Place, London, E14 5HP, United > Kingdom. > > Barclays Bank PLC is authorised by the Prudential Regulation > Authority and regulated by the Financial Conduct Authority and the > Prudential Regulation Authority (Financial Services Register No. > 122702). > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > http://supercolony.gluster.org/mailman/listinfo/gluster-users >