issues replacing a failed node

jjolet at drillinginfo.com (John Jolet) · Thu, 17 May 2012 01:47:19 +0000

I had a two-node replicated/distributed volume, spread across server1:/bricks/1 server2:/bricks/1 server1:/bricks/2 server2:/bricks/2.  I powered down server2 in order to re-rack it to make room for server3.  server2 fails to come up, for reasons having nothing to do with gluster.  So I decided to go ahead and bring up server3 and move server2's bricks to it.  I saw conflicting information on how to do that with a completely dead node and a new node of a different name.  Basically i did a peer probe server3, then volume replace-brick share name server2:/bricks/1 server3:/bricks/1.  then i did a volume replace-brick <blah> commit force.

this was probably a bad thing.  then i tried to do the replace-brick with the second set.  it fails to start saying replace-brick is already running on the volume.  now i'm stuck.  the data in brick/1 DOES appear on the new node, but i can't do anything with brick/2.  

if i try to do a commit, it says bricks/1 isn't on server2, and if i try to do anything else it says replace-brick is running.  i did a rebalance, hoping that would fix it, but it has not.  I attempted to stop the volume, but it said i couldn't until the replace-brick was committed or aborted.  I cannot abort, it says replace-brick abort failed.  Now what?  Mind, this is a temporary setup which has a complex directory structure, but no data as yet.  We are looking to use this for production VERY soon, and i'm not sure that (a) i have time to rebuild everything, and (and more importantly) (b) i need to be able to demonstrate to management that "look, a node failed and we replaced it with no data loss".

so, what's my next step to get this mess untangled, and the data safely on my new node...