Ok, it appears that the following worked. Thanks for the nudge in the right direction: volume replace-brick test-a 10.250.4.65:/localmnt/g2lv5 10.250.4.65:/localmnt/g2lv6 commit force then volume heal test-a full and monitor the progress with volume heal test-a info However that does not solve my problem for what to do when a brick is corrupted somehow, if I don't have enough space to first heal it and then replace it. That did get me thinking though, "what if I replace the brick, forgoe the heal, replace it again and then do a heal?" That seems to work. So if I lose one brick, here is the process that I used to recover it: 1) create a directory that is just to temporary trick gluster and allow us to maintain the correct replica count: mkdir /localmnt/garbage 2) replace the dead brick with our garbage directory: volume replace-brick test-a 10.250.4.65:/localmnt/g2lv5 10.250.4.65:/localmnt/garbage commit force 3) fix our dead brick using whatever process is required. in this case, for testing, we had to remove some gluster bits or it throws the "already part of a volume error": setfattr -x trusted.glusterfs.volume-id /localmnt/g2lv5 setfattr -x trusted.gfid /localmnt/g2lv5 4) now that our dead brick is fixed, swap it for the garbage/temporary brick: volume replace-brick test-a 10.250.4.65:/localmnt/garbage 10.250.4.65:/localmnt/g2lv5 commit force 5) now all that we have to do is let gluster heal the volume: volume heal test-a full Is there anything wrong with this procedure? Cheers, Dave On Fri, Aug 16, 2013 at 11:03 AM, David Gibbons <david.c.gibbons at gmail.com>wrote: > Ravi, > > Thanks for the tips. When I run a volume status: > gluster> volume status test-a > Status of volume: test-a > Gluster process Port Online Pid > > ------------------------------------------------------------------------------ > Brick 10.250.4.63:/localmnt/g1lv2 49152 Y > 8072 > Brick 10.250.4.65:/localmnt/g2lv2 49152 Y > 3403 > Brick 10.250.4.63:/localmnt/g1lv3 49153 Y > 8081 > Brick 10.250.4.65:/localmnt/g2lv3 49153 Y > 3410 > Brick 10.250.4.63:/localmnt/g1lv4 49154 Y > 8090 > Brick 10.250.4.65:/localmnt/g2lv4 49154 Y > 3417 > Brick 10.250.4.63:/localmnt/g1lv5 49155 Y > 8099 > Brick 10.250.4.65:/localmnt/g2lv5 N/A N > N/A > Brick 10.250.4.63:/localmnt/g1lv1 49156 Y > 8576 > Brick 10.250.4.65:/localmnt/g2lv1 49156 Y > 3431 > NFS Server on localhost 2049 Y > 3440 > Self-heal Daemon on localhost N/A Y > 3445 > NFS Server on 10.250.4.63 2049 Y > 8586 > Self-heal Daemon on 10.250.4.63 N/A Y > 8593 > > There are no active volume tasks > -- > > Attempting to start the volume results in: > gluster> volume start test-a force > volume start: test-a: failed: Failed to get extended attribute > trusted.glusterfs.volume-id for brick dir /localmnt/g2lv5. Reason : No data > available > -- > > It doesn't like when I try to fire off a heal either: > gluster> volume heal test-a > Launching Heal operation on volume test-a has been unsuccessful > -- > > Although that did lead me to this: > gluster> volume heal test-a info > Gathering Heal info on volume test-a has been successful > > Brick 10.250.4.63:/localmnt/g1lv2 > Number of entries: 0 > > Brick 10.250.4.65:/localmnt/g2lv2 > Number of entries: 0 > > Brick 10.250.4.63:/localmnt/g1lv3 > Number of entries: 0 > > Brick 10.250.4.65:/localmnt/g2lv3 > Number of entries: 0 > > Brick 10.250.4.63:/localmnt/g1lv4 > Number of entries: 0 > > Brick 10.250.4.65:/localmnt/g2lv4 > Number of entries: 0 > > Brick 10.250.4.63:/localmnt/g1lv5 > Number of entries: 0 > > Brick 10.250.4.65:/localmnt/g2lv5 > Status: Brick is Not connected > Number of entries: 0 > > Brick 10.250.4.63:/localmnt/g1lv1 > Number of entries: 0 > > Brick 10.250.4.65:/localmnt/g2lv1 > Number of entries: 0 > -- > > So perhaps I need to re-connect the brick? > > Cheers, > Dave > > > > On Fri, Aug 16, 2013 at 12:43 AM, Ravishankar N <ravishankar at redhat.com>wrote: > >> On 08/15/2013 10:05 PM, David Gibbons wrote: >> >> Hi There, >> >> I'm currently testing Gluster for possible production use. I haven't >> been able to find the answer to this question in the forum arch or in the >> public docs. It's possible that I don't know which keywords to search for. >> >> Here's the question (more details below): let's say that one of my >> bricks "fails" -- *not* a whole node failure but a single brick failure >> within the node. How do I replace a single brick on a node and force a sync >> from one of the replicas? >> >> I have two nodes with 5 bricks each: >> gluster> volume info test-a >> >> Volume Name: test-a >> Type: Distributed-Replicate >> Volume ID: e8957773-dd36-44ae-b80a-01e22c78a8b4 >> Status: Started >> Number of Bricks: 5 x 2 = 10 >> Transport-type: tcp >> Bricks: >> Brick1: 10.250.4.63:/localmnt/g1lv2 >> Brick2: 10.250.4.65:/localmnt/g2lv2 >> Brick3: 10.250.4.63:/localmnt/g1lv3 >> Brick4: 10.250.4.65:/localmnt/g2lv3 >> Brick5: 10.250.4.63:/localmnt/g1lv4 >> Brick6: 10.250.4.65:/localmnt/g2lv4 >> Brick7: 10.250.4.63:/localmnt/g1lv5 >> Brick8: 10.250.4.65:/localmnt/g2lv5 >> Brick9: 10.250.4.63:/localmnt/g1lv1 >> Brick10: 10.250.4.65:/localmnt/g2lv1 >> >> I formatted 10.250.4.65:/localmnt/g2lv5 (to simulate a "failure"). What >> is the next step? I have tried various combinations of removing and >> re-adding the brick, replacing the brick, etc. I read in a previous message >> to this list that replace-brick was for planned changes which makes sense, >> so that's probably not my next step. >> >> You must first check if the 'formatted' brick 10.250.4.65:/localmnt/g2lv5 >> is online using the `gluster volume status` command. If not start the >> volume using `gluster volume start <VOLNAME>force`. You can then use the >> gluster volume heal command which would copy the data from the other >> replica brick into your formatted brick. >> Hope this helps. >> -Ravi >> >> >> Cheers, >> Dave >> >> >> _______________________________________________ >> Gluster-users mailing listGluster-users at gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-users >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130816/20b094e1/attachment.html>