I'd just like to make an update according to my latest findings on this. Googling further, I ended up reading this article: https://community.rackspace.com/developers/f/7/t/4858 Reflecting it to the docs (https://gluster.readthedocs.io/en/latest/Administrator Guide/Resolving Peer Rejected/) and my situation, I was able to establish a reproducible chain of events, like this: #stop the glusterfs sst2# service glusterfs-server stop sst2# killall glusterfs glusterfsd # make sure there are no more glusterfs processes sst2# ps auwwx | grep gluster # preserve glusterd.info and clean everything else sst2# cd /var/lib/glusterd && mv glusterd.info .. && rm -rf * && mv ../glusterd.info . # start glusterfs sst2# service glusterfs-server start # probe peers sst2# gluster peer status Number of Peers: 0 sst2# gluster peer probe sst0 peer probe: success. sst2# gluster peer probe sst1 peer probe: success. # restart glusterd twice to bring peers back into the cluster sst2# gluster peer status Number of Peers: 2 Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Accepted peer request (Connected) Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Accepted peer request (Connected) sst2# service glusterfs-server restart sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Sent and Received peer request (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Sent and Received peer request (Connected) sst2# service glusterfs-server restart sst2# gluster peer status Number of Peers: 2 Hostname: sst1 Uuid: 5a2198de-f536-4328-a278-7f746f276e35 State: Peer in Cluster (Connected) Hostname: sst0 Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc State: Peer in Cluster (Connected) # resync volume information sst2# gluster volume sync sst0 all Sync volume may make data inaccessible while the sync is in progress. Do you want to continue? (y/n) y volume sync: success sst2# gluster volume info Volume Name: gv0 Type: Replicate Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: sst0:/var/glusterfs Brick2: sst2:/var/glusterfs Options Reconfigured: cluster.self-heal-daemon: enable performance.readdir-ahead: on storage.owner-uid: 1000 storage.owner-gid: 1000 sst2# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick sst0:/var/glusterfs 49153 0 Y 29830 Brick sst2:/var/glusterfs 49152 0 Y 5137 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 6034 NFS Server on sst0 N/A N/A N N/A Self-heal Daemon on sst0 N/A N/A Y 29821 NFS Server on sst1 N/A N/A N N/A Self-heal Daemon on sst1 N/A N/A Y 19997 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks sst2# gluster volume heal gv0 full Launching heal operation to perform full self heal on volume gv0 has been successful Use heal info commands to check status sst2# gluster volume heal gv0 info Brick sst0:/var/glusterfs Status: Connected Number of entries: 0 Brick sst2:/var/glusterfs Status: Connected Number of entries: 0 The most disturbing thing about this is that I'm perfectly sure that bricks are NOT in sync, according to du -s output: sst0# du -s /var/glusterfs/ 3107570500 /var/glusterfs/ sst2# du -s /var/glusterfs/ 3107567396 /var/glusterfs/ If anybody could be so kind and point me out how to get replicase back to the synchronous state, I would be extremely grateful. Best, Seva 28.04.2017, 13:01, "Seva Gluschenko" <gvs@xxxxxxxxxxxxx>: > Of course. Please find attached. Hope they can shed some light on this. > > Thanks, > > Seva > > 28.04.2017, 12:41, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>: >> Can you share the glusterd logs from the three nodes ? >> >> Rafi KC >> >> On 04/28/2017 02:34 PM, Seva Gluschenko wrote: >>> Dear Community, >>> >>> I call for your wisdom, as it appears that googling for keywords doesn't help much. >>> >>> I have a glusterfs volume with replica count 2, and I tried to perform the online upgrade procedure described in the docs (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It all went almost fine when I'd done with the first replica, the only problem was the self-heal procedure that refused to complete until I commented out all IPv6 entries in the /etc/hosts. >>> >>> So far, being sure that it all should work on the 2nd replica pretty the same as it was on the 1st one, I had proceeded with the upgrade on the replica 2. All of a sudden, it told me that it doesn't see the first replica at all. The state before upgrade was: >>> >>> sst2# gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> ------------------------------------------------------------------------------ >>> Brick sst0:/var/glusterfs 49152 0 Y 3482 >>> Brick sst2:/var/glusterfs 49152 0 Y 29863 >>> NFS Server on localhost 2049 0 Y 25175 >>> Self-heal Daemon on localhost N/A N/A Y 25283 >>> NFS Server on sst0 N/A N/A N N/A >>> Self-heal Daemon on sst0 N/A N/A Y 4827 >>> NFS Server on sst1 N/A N/A N N/A >>> Self-heal Daemon on sst1 N/A N/A Y 15009 >>> >>> Task Status of Volume gv0 >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Peer in Cluster (Connected) >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Sent and Received peer request (Connected) >>> >>> sst2# gluster volume heal gv0 info >>> Brick sst0:/var/glusterfs >>> Number of entries: 0 >>> >>> Brick sst2:/var/glusterfs >>> Number of entries: 0 >>> >>> After upgrade, it looked like this: >>> >>> sst2# gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> ------------------------------------------------------------------------------ >>> Brick sst2:/var/glusterfs N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> >>> Task Status of Volume gv0 >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Sent and Received peer request (Connected) >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Peer Rejected (Connected) >>> >>> My biggest fault probably, at that point I googled and found this article https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ -- and followed its advice, removing at sst2 all the /var/lib/glusterd contents except the glusterd.info file. As the result, the node, predictably, lost all information about the volume. >>> >>> sst2# gluster volume status >>> No volumes present >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Accepted peer request (Connected) >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Accepted peer request (Connected) >>> >>> Okay, I thought, this is might be a high time to re-add the brick. Not that easy, Jack: >>> >>> sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs' >>> volume add-brick: failed: Operation failed >>> >>> The reason appeared to be natural: sst0 still knows that there was the replica on sst2. What should I do then? At this point, I tried to recover the volume information on sst2 by putting it offline and copying all the volume info from the sst0. Of course it wasn't enough to just copy as is, I modified /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting listen-port=0 for the remote brick (sst0) and listen-port=49152 for the local brick (sst2). It didn't help much, unfortunately. The final state I've reached is as follows: >>> >>> sst2# gluster peer status >>> Number of Peers: 2 >>> >>> Hostname: sst1 >>> Uuid: 5a2198de-f536-4328-a278-7f746f276e35 >>> State: Sent and Received peer request (Connected) >>> >>> Hostname: sst0 >>> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc >>> State: Sent and Received peer request (Connected) >>> >>> sst2# gluster volume info >>> >>> Volume Name: gv0 >>> Type: Replicate >>> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: sst0:/var/glusterfs >>> Brick2: sst2:/var/glusterfs >>> Options Reconfigured: >>> cluster.self-heal-daemon: enable >>> performance.readdir-ahead: on >>> storage.owner-uid: 1000 >>> storage.owner-gid: 1000 >>> >>> sst2# gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> ------------------------------------------------------------------------------ >>> Brick sst2:/var/glusterfs N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> NFS Server on localhost N/A N/A N N/A >>> >>> Task Status of Volume gv0 >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> Meanwhile, on sst0: >>> >>> sst0# gluster volume info >>> >>> Volume Name: gv0 >>> Type: Replicate >>> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: sst0:/var/glusterfs >>> Brick2: sst2:/var/glusterfs >>> Options Reconfigured: >>> storage.owner-gid: 1000 >>> storage.owner-uid: 1000 >>> performance.readdir-ahead: on >>> cluster.self-heal-daemon: enable >>> >>> sst0 ~ # gluster volume status >>> Status of volume: gv0 >>> Gluster process TCP Port RDMA Port Online Pid >>> ------------------------------------------------------------------------------ >>> Brick sst0:/var/glusterfs 49152 0 Y 31263 >>> NFS Server on localhost N/A N/A N N/A >>> Self-heal Daemon on localhost N/A N/A Y 31254 >>> >>> Task Status of Volume gv0 >>> ------------------------------------------------------------------------------ >>> There are no active volume tasks >>> >>> Any ideas how to bring the sst2 back to normal are appreciated. As a last resort solution, I can schedule the downtime, backup data, kill the volume and start all over, but I would like to know if there is a shorter path. Thank you very much in advance. >>> >>> -- >>> Best Regards, >>> >>> Seva Gluschenko _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users