Replacing a failed brick

david.c.gibbons at gmail.com (David Gibbons) · Fri, 16 Aug 2013 11:29:26 -0400

Ok, it appears that the following worked. Thanks for the nudge in the right
direction:

volume replace-brick test-a 10.250.4.65:/localmnt/g2lv5
10.250.4.65:/localmnt/g2lv6
commit force

then
volume heal test-a full

and monitor the progress with
volume heal test-a info

However that does not solve my problem for what to do when a brick is
corrupted somehow, if I don't have enough space to first heal it and then
replace it.

That did get me thinking though, "what if I replace the brick, forgoe the
heal, replace it again and then do a heal?" That seems to work.

So if I lose one brick, here is the process that I used to recover it:
1) create a directory that is just to temporary trick gluster and allow us
to maintain the correct replica count: mkdir /localmnt/garbage
2) replace the dead brick with our garbage directory: volume replace-brick
test-a 10.250.4.65:/localmnt/g2lv5 10.250.4.65:/localmnt/garbage commit
force
3) fix our dead brick using whatever process is required. in this case, for
testing, we had to remove some gluster bits or it throws the "already part
of a volume error":
setfattr -x trusted.glusterfs.volume-id /localmnt/g2lv5
setfattr -x trusted.gfid /localmnt/g2lv5
4) now that our dead brick is fixed, swap it for the garbage/temporary
brick: volume replace-brick test-a 10.250.4.65:/localmnt/garbage
10.250.4.65:/localmnt/g2lv5 commit force
5) now all that we have to do is let gluster heal the volume: volume heal
test-a full

Is there anything wrong with this procedure?

Cheers,
Dave

On Fri, Aug 16, 2013 at 11:03 AM, David Gibbons
<david.c.gibbons at gmail.com>wrote:

> Ravi,
>
> Thanks for the tips. When I run a volume status:
> gluster> volume status test-a
> Status of volume: test-a
> Gluster process                                         Port    Online  Pid
>
> ------------------------------------------------------------------------------
> Brick 10.250.4.63:/localmnt/g1lv2                       49152   Y
> 8072
> Brick 10.250.4.65:/localmnt/g2lv2                       49152   Y
> 3403
> Brick 10.250.4.63:/localmnt/g1lv3                       49153   Y
> 8081
> Brick 10.250.4.65:/localmnt/g2lv3                       49153   Y
> 3410
> Brick 10.250.4.63:/localmnt/g1lv4                       49154   Y
> 8090
> Brick 10.250.4.65:/localmnt/g2lv4                       49154   Y
> 3417
> Brick 10.250.4.63:/localmnt/g1lv5                       49155   Y
> 8099
> Brick 10.250.4.65:/localmnt/g2lv5                       N/A     N
> N/A
> Brick 10.250.4.63:/localmnt/g1lv1                       49156   Y
> 8576
> Brick 10.250.4.65:/localmnt/g2lv1                       49156   Y
> 3431
> NFS Server on localhost                                 2049    Y
> 3440
> Self-heal Daemon on localhost                           N/A     Y
> 3445
> NFS Server on 10.250.4.63                               2049    Y
> 8586
> Self-heal Daemon on 10.250.4.63                         N/A     Y
> 8593
>
> There are no active volume tasks
> --
>
> Attempting to start the volume results in:
> gluster> volume start test-a force
> volume start: test-a: failed: Failed to get extended attribute
> trusted.glusterfs.volume-id for brick dir /localmnt/g2lv5. Reason : No data
> available
> --
>
> It doesn't like when I try to fire off a heal either:
> gluster> volume heal test-a
> Launching Heal operation on volume test-a has been unsuccessful
> --
>
> Although that did lead me to this:
> gluster> volume heal test-a info
> Gathering Heal info on volume test-a has been successful
>
> Brick 10.250.4.63:/localmnt/g1lv2
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv2
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv3
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv3
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv4
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv4
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv5
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv5
> Status: Brick is Not connected
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv1
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv1
> Number of entries: 0
> --
>
> So perhaps I need to re-connect the brick?
>
> Cheers,
> Dave
>
>
>
> On Fri, Aug 16, 2013 at 12:43 AM, Ravishankar N <ravishankar at redhat.com>wrote:
>
>>  On 08/15/2013 10:05 PM, David Gibbons wrote:
>>
>> Hi There,
>>
>>  I'm currently testing Gluster for possible production use. I haven't
>> been able to find the answer to this question in the forum arch or in the
>> public docs. It's possible that I don't know which keywords to search for.
>>
>>  Here's the question (more details below): let's say that one of my
>> bricks "fails" -- *not* a whole node failure but a single brick failure
>> within the node. How do I replace a single brick on a node and force a sync
>> from one of the replicas?
>>
>>  I have two nodes with 5 bricks each:
>>  gluster> volume info test-a
>>
>>  Volume Name: test-a
>> Type: Distributed-Replicate
>> Volume ID: e8957773-dd36-44ae-b80a-01e22c78a8b4
>> Status: Started
>> Number of Bricks: 5 x 2 = 10
>> Transport-type: tcp
>> Bricks:
>> Brick1: 10.250.4.63:/localmnt/g1lv2
>> Brick2: 10.250.4.65:/localmnt/g2lv2
>> Brick3: 10.250.4.63:/localmnt/g1lv3
>> Brick4: 10.250.4.65:/localmnt/g2lv3
>> Brick5: 10.250.4.63:/localmnt/g1lv4
>> Brick6: 10.250.4.65:/localmnt/g2lv4
>> Brick7: 10.250.4.63:/localmnt/g1lv5
>> Brick8: 10.250.4.65:/localmnt/g2lv5
>> Brick9: 10.250.4.63:/localmnt/g1lv1
>> Brick10: 10.250.4.65:/localmnt/g2lv1
>>
>>  I formatted 10.250.4.65:/localmnt/g2lv5 (to simulate a "failure"). What
>> is the next step? I have tried various combinations of removing and
>> re-adding the brick, replacing the brick, etc. I read in a previous message
>> to this list that replace-brick was for planned changes which makes sense,
>> so that's probably not my next step.
>>
>> You must first check if the 'formatted' brick 10.250.4.65:/localmnt/g2lv5
>> is online using the `gluster volume status` command. If not start the
>> volume using `gluster volume start <VOLNAME>force`. You can then use the
>> gluster volume heal command which would copy the data from the other
>> replica brick into your formatted brick.
>> Hope this helps.
>> -Ravi
>>
>>
>>  Cheers,
>> Dave
>>
>>
>> _______________________________________________
>> Gluster-users mailing listGluster-users at gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130816/20b094e1/attachment.html>