How to remove a dead node and re-balance volume?

anup at corp.india.com (Anup Nair) · Thu, 5 Sep 2013 14:46:27 +0530

On Thu, Sep 5, 2013 at 12:41 AM, Vijay Bellur <vbellur at redhat.com> wrote:

> On 09/03/2013 01:18 PM, Anup Nair wrote:
>
>> Glusterfs version 3.2.2
>>
>> I have a Gluster volume in which one our of the 4 peers/nodes had
>> crashed some time ago, prior to my joining service here.
>>
>> I see from volume info that the crashed (non-existing) node is still
>> listed as one of the peers and the bricks are also listed. I would like
>> to detach this node and its bricks and rebalance the volume with
>> remaining 3 peers. But I am unable to do so. Here are my setps:
>>
>> 1. #gluster peer status
>>    Number of Peers: 3 -- (note: excluding the one I run this command from)
>>
>>    Hostname: dbstore4r294 --- (note: node/peer that is down)
>>    Uuid: 8bf13458-1222-452c-81d3-**565a563d768a
>>    State: Peer in Cluster (Disconnected)
>>
>>    Hostname: 172.16.1.90
>>    Uuid: 77ebd7e4-7960-4442-a4a4-**00c5b99a61b4
>>    State: Peer in Cluster (Connected)
>>
>>    Hostname: dbstore3r294
>>    Uuid: 23d7a18c-fe57-47a0-afbc-**1e1a5305c0eb
>>    State: Peer in Cluster (Connected)
>>
>> 2. #gluster peer detach dbstore4r294
>>    Brick(s) with the peer dbstore4r294 exist in cluster
>>
>> 3. #gluster volume info
>>
>>    Volume Name: test-volume
>>    Type: Distributed-Replicate
>>    Status: Started
>>    Number of Bricks: 4 x 2 = 8
>>    Transport-type: tcp
>>    Bricks:
>>    Brick1: dbstore1r293:/datastore1
>>    Brick2: dbstore2r293:/datastore1
>>    Brick3: dbstore3r294:/datastore1
>>    Brick4: dbstore4r294:/datastore1
>>    Brick5: dbstore1r293:/datastore2
>>    Brick6: dbstore2r293:/datastore2
>>    Brick7: dbstore3r294:/datastore2
>>    Brick8: dbstore4r294:/datastore2
>>    Options Reconfigured:
>>    network.ping-timeout: 42s
>>    performance.cache-size: 64MB
>>    performance.write-behind-**window-size: 3MB
>>    performance.io-thread-count: 8
>>    performance.cache-refresh-**timeout: 2
>>
>> Note that the non-existent node/peer is  -- dbstore4r294 (bricks are
>> :/datastore1 & /datastore2  - i.e.  brick4 and brick8)
>>
>> 4. #gluster volume remove-brick test-volume dbstore4r294:/datastore1
>>    Removing brick(s) can result in data loss. Do you want to Continue?
>> (y/n) y
>>    Remove brick incorrect brick count of 1 for replica 2
>>
>> 5. #gluster volume remove-brick test-volume dbstore4r294:/datastore1
>> dbstore4r294:/datastore2
>>    Removing brick(s) can result in data loss. Do you want to Continue?
>> (y/n) y
>>    Bricks not from same subvol for replica
>>
>> How do I remove the peer? What are the steps considering that the node
>> is non-existent?
>> */
>>
>
>
> Do you plan to replace the dead server with a new server? If so, this
> could be a possible sequence of steps:
>
>
No. We are not going to replace it. So, I need to resize it to a 3 node
cluster.

I discovered the issue when one of the node hung and I had to reboot it. I
expected Gluster volume to be available for one node failure. The volume
was non-responsive.

Surprised at that, I checked the details and found it was running with one
node missing for many months now, perhaps an year!

I have no node to replace it with. So, I am looking for a method by which I
can resize it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130905/cd5e3e7e/attachment.html>