Unable to remove / replace faulty bricks

vbellur at redhat.com (Vijay Bellur) · Tue, 18 Jun 2013 17:02:38 +0530

On 06/18/2013 11:43 AM, elvinas.piliponis at barclays.com wrote:
> Hello,
>
> When trying to recover from failed node and replace brick with spare one
> I have trashed my cluster and now it is in stuck state.
>
> Any ideas, how to reintroduce/remove those nodes and bring peace and
> order to cluster?
>
> There was a pending brick replacement operation from 0031 to 0028 (it is
> still not commited according to rbstate file)
>
> There was a hardware failure on 0022 node
>
> I was not able to commit replace brick 0031 due to 0022 was not
> responding and not giving cluster lock to requesting node.
>
> I was not able to start replacement 0022 to 0028  due to pending brick
> replacement
>
> I have forced peer removal from cluster, hoping that afterwards I would
> be able to complete operations. Unfortunately I have removes not only
> 0022 but 0031 also.
>
> I have peer probed 0031 successfully. But now gluster volume info and
> volume status both lists 0031 node. But when I attempt to do a brick
> operation I do get:
>
> gluster volume remove-brick glustervmstore 0031:/mnt/vmstore/brick
> 0036:/mnt/vmstore/brick force
>
> Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
>
> Incorrect brick 0031:/mnt/vmstore/brick for volume glustervmstore
>
> gluster volume replace-brick glustervmstore 0031:/mnt/vmstore/brick
> 0028:/mnt/vmstore/brick commit force
>
> brick: 0031:/mnt/vmstore/brick does not exist in volume: glustervmstore

Looks like these commands are being rejected from a node where the 
volume information is not current. Can you please provide glusterd logs 
from the node where these commands were issued?

Thanks,
Vijay