Well, it looks like I've stumped the list, so I did a bit of additional digging myself: azathoth replicates with yog-sothoth, so I compared their brick directories. `ls -R /var/local/brick0/data | md5sum` gives the same result on both servers, so the filenames are identical in both bricks. However, `du -s /var/local/brick0/data` shows that azathoth has about 3G more data (445G vs 442G) than yog. This seems consistent with my assumption that the problem is on yog-sothoth (everything is fine with only azathoth; there are problems with only yog-sothoth) and I am reminded that a few weeks ago, yog-sothoth was offline for 4-5 days, although it should have been brought back up-to-date once it came back online. So, assuming that the issue is stale/missing data on yog-sothoth, is there a way to force gluster to do a full refresh of the data from azathoth's brick to yog-sothoth's brick? I would have expected running heal and/or rebalance to do that sort of thing, but I've run them both (with and without fix-layout on the rebalance) and the problem persists. If there isn't a way to force a refresh, how risky would it be to kill gluster on yog-sothoth, wipe everything from /var/local/brick0, and then re-add it to the cluster as if I were replacing a physically failed disk? Seems like that should work in principle, but it feels dangerous to wipe the partition and rebuild, regardless. On Tue, Feb 13, 2018 at 07:33:44AM -0600, Dave Sherohman wrote: > I'm using gluster for a virt-store with 3x2 distributed/replicated > servers for 16 qemu/kvm/libvirt virtual machines using image files > stored in gluster and accessed via libgfapi. Eight of these disk images > are standalone, while the other eight are qcow2 images which all share a > single backing file. > > For the most part, this is all working very well. However, one of the > gluster servers (azathoth) causes three of the standalone VMs and all 8 > of the shared-backing-image VMs to fail if it goes down. Any of the > other gluster servers can go down with no problems; only azathoth causes > issues. > > In addition, the kvm hosts have the gluster volume fuse mounted and one > of them (out of five) detects an error on the gluster volume and puts > the fuse mount into read-only mode if azathoth goes down. libgfapi > connections to the VM images continue to work normally from this host > despite this and the other four kvm hosts are unaffected. > > It initially seemed relevant that I have the libgfapi URIs specified as > gluster://azathoth/..., but I've tried changing them to make the initial > connection via other gluster hosts and it had no effect on the problem. > Losing azathoth still took them out. > > In addition to changing the mount URI, I've also manually run a heal and > rebalance on the volume, enabled the bitrot daemons (then turned them > back off a week later, since they reported no activity in that time), > and copied one of the standalone images to a new file in case it was a > problem with the file itself. As far as I can tell, none of these > attempts changed anything. > > So I'm at a loss. Is this a known type of problem? If so, how do I fix > it? If not, what's the next step to troubleshoot it? > > > # gluster --version > glusterfs 3.8.8 built on Jan 11 2017 14:07:11 > Repository revision: git://git.gluster.com/glusterfs.git > > # gluster volume status > Status of volume: palantir > Gluster process TCP Port RDMA Port Online > Pid > ------------------------------------------------------------------------------ > Brick saruman:/var/local/brick0/data 49154 0 Y > 10690 > Brick gandalf:/var/local/brick0/data 49155 0 Y > 18732 > Brick azathoth:/var/local/brick0/data 49155 0 Y > 9507 > Brick yog-sothoth:/var/local/brick0/data 49153 0 Y > 39559 > Brick cthulhu:/var/local/brick0/data 49152 0 Y > 2682 > Brick mordiggian:/var/local/brick0/data 49152 0 Y > 39479 > Self-heal Daemon on localhost N/A N/A Y > 9614 > Self-heal Daemon on saruman.lub.lu.se N/A N/A Y > 15016 > Self-heal Daemon on cthulhu.lub.lu.se N/A N/A Y > 9756 > Self-heal Daemon on gandalf.lub.lu.se N/A N/A Y > 5962 > Self-heal Daemon on mordiggian.lub.lu.se N/A N/A Y > 8295 > Self-heal Daemon on yog-sothoth.lub.lu.se N/A N/A Y > 7588 > > Task Status of Volume palantir > ------------------------------------------------------------------------------ > Task : Rebalance > ID : c38e11fe-fe1b-464d-b9f5-1398441cc229 > Status : completed > > > -- > Dave Sherohman > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://lists.gluster.org/mailman/listinfo/gluster-users -- Dave Sherohman _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users