On Mon, Dec 03, 2012 at 01:44:47PM +0000, Brian Candler wrote: > So this all looks broken, and as I can't find any gluster documentation > saying what these various states mean, I'm not sure how to proceed. Any > suggestions? Update. On storage1 and storage3 I killed all glusterfs(d) processes, did rm /var/lib/glusterd/peers/* rm -rf /var/lib/glusterd/vols/* and restarted glusterd. Then I did "gluster peer probe storage2". On the first attempt, I was getting State: Accepted peer request (Connected) Couldn't work out why it didn't move to full connected peer. But after detach and probe again, from storage3 I got State: Peer in Cluster (Connected) which suggests it is OK. However "gluster volume info" on both shows that I have lost the volume I had on storage3. Trying to recreate it: # gluster volume create scratch3 storage3:/disk/scratch/scratch3 /disk/scratch/scratch3 or a prefix of it is already part of a volume Now I do remember seeing something about a script to remove xattrs, but I can't find it in the ubuntu glusterfs-{server,common,client,examples} packages. Back to mailing list archives: http://www.mail-archive.com/gluster-users at gluster.org/msg09013.html So I did the two setfattr commands and was able to recreate the volume without loss of data. storage1 was a bit more awkward: root at storage1:/var/lib/glusterd# gluster peer status No peers present root at storage1:/var/lib/glusterd# gluster peer probe storage2 storage2 is already part of another cluster <<Digs around source code>> <<./xlators/mgmt/glusterd/src/glusterd-handler.c>> OK, because storage2 already has a peer, it looks like I have to probe storage1 from storage2, not the other way round. It works this time. So I think it's all working again now, but for someone who was not prepared to experiment and get dirty, it would have been a very hairy experience. I have to say that in my opinion, the two worst aspects of glusterfs by far are: - lack of error reporting, other than grubbing through log files on both client and server - lack of documentation (especially recovery procedures for things like failed bricks, replacing bricks, volume info out of sync, split-brain data out of sync) Unfortunately, live systems are not where you want to be experimenting :-( Regards, Brian.