On Fri, Feb 12, 2016 at 8:54 AM, Mike Stump <mikestump@xxxxxxxxxxx> wrote: > So, I lost one of my servers and the OS was reinstalled. The gluster data is on another disk that survives OS reinstalls. /var/lib/gluster however does not. > > I was following the bring it back up directions, but before I did that, I think a peer probe was done with the new uuid. This caused it to be dropped from the cluster, entirely. > > I edited the uuid to be back what it was, but now it is no longer in the cluster. The web site didn’t seem to have any help for how to undo the drop. It was part of a replica 2 pair, and I would like to merely have it come up and be apart of the cluster again. It has all the data (as I run with quorum and all the replica 2 pair contents are R/O until this server comes back). I don’t mind letting it refresh from the other pair member of the replica, even though the data is already on disk. > > I tried: > > # gluster volume replace-brick g2home machine04:/.g/g2home machine04:/.g/g2home-new commit force > volume replace-brick: failed: Host machine04 is not in 'Peer in Cluster’ state > > to try and let it resync into the cluster, but, it won’t let me replace the brick. I can’t do: > > # gluster peer detach machine04 > peer detach: failed: Brick(s) with the peer machine04 exist in cluster > > either. What I wanted it to do, it when it connected to the cluster the first time with the new uuid, the cluster should inform it it might have filesystems on it (it comes in with a name already in the peer list), and get brick information from the cluster and check it out. If it has those, it should just notice the uuid is wrong, fix it, make it part of the cluster again, spin it all up and continue on. > > I tried; > > # gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new > volume add-brick: failed: Volume g2home does not exist > > and it didn’t work on either machine04, nor one of the peers: > > # gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new > volume add-brick: failed: Operation failed > > So, to try and fix the Peer in Cluster issue, I stop and restarted glistered many time, and eventually most all resented and came up into the Peer in Cluster state. All except for 1 that was endlessly confused. So, if the network works, it should wipe the peer, and just retry the entire state machine to get back into the right state. I had to stop the server on the two machines and then manually edit the state to be 3, and then restart them. It then at least showed the right state on both. > > Next, let’s try and sync up the bricks: > > root@machine04:/# gluster volume sync machine00 all > Sync volume may make data inaccessible while the sync is in progress. Do you want to continue? (y/n) y > volume sync: success > root@machine04:/# gluster vol info > No volumes present > > root@machine02:/# gluster volume heal g2home full > Staging failed on machine04. Error: Volume g2home does not exist > > Think about that. This is a replica 2 server, the entire point would be to fix up the array if one of the machines was screwy. heal seemed like the command to fix it up. > > So, now that it is connected, let’s try this again: > > # gluster volume replace-brick g2home machine04:/.g/g2home machine04:/.g/g2home-new commit force > volume replace-brick: failed: Pre Validation failed on machine04. volume: g2home does not exist > > Nope, that won’t work. So, let’s try removing: > > # gluster vol remove-brick g2home replica 2 machine04:/.g/g2home machine05:/.g/g2home start > volume remove-brick start: failed: Staging failed on machine04. Please check log file for details. > > Nope, that won’t either. What’s the point of remove, if it won’t work? > > Ok, fine, lets for for a bigger hammer: > > # gluster peer detach machine04 force > peer detach: failed: Brick(s) with the peer machine04 exist in cluster > > Doh. I know that, but, it is a replica! > > [ more googling ] > > Someone said to just copy the entire vols directory. [ cross fingers ] copy vols. > > Ok, I can now do a gluster volume status g2home detail, which I could not before. Files seem to be R/W on the array now. I think that might have worked. > > So, why can’t gluster copy vols by itself, if indeed that is the right thing to do? Gluster should actually do that, provided the peer is in the 'Peer in cluster' state. > > Why can’t the document say, just edit the state variable and just copy vols to get it going again? Which document did you refer to? I'm not aware of a document that describes how to recover a peer after the loss of /var/lib/glusterd. The following steps should have helped you get back the cluster into a good state quickly. On the newly reinstalled peer, before starting glusterd, 1. Create the /var/lib/glusterd/glusterd.info file and fill it with the peer previous uuid and operating-version. The uuid can be obtained from the peerinfo files in /var/lib/glusterd/peers on the other peers. The operating-version from glusterd.info on the other peers. 2. From one of the other peers copy over /var/lib/glusterd/peers . Remove the peerinfo file for this peer. This should allow glusterd on this peer to accept connections from the rest of the cluster and also connect to the rest of the cluster. 3. Start glusterd 4. The remaining information on volumes and other peers should be synced over automatically, and the bricks and other daemons should start running. (We should probably put this down somewhere). > > Why can’t probe figure out that you were already part of a cluster, and when it runs, it notices that your brains have been wiped, and just grab that info from the cluster and bring the node back up? It can even run heal on the data to ensure that nothing messed with it and that it matches the other replica. > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users