Thank you for replying! > Okay so 0-cm_shared-replicate-1 means these 3 bricks: > > Brick4: 172.23.0.6:/data/brick_cm_shared > Brick5: 172.23.0.7:/data/brick_cm_shared > Brick6: 172.23.0.8:/data/brick_cm_shared The above is correct. > Were there any pending self-heals for this volume? Is it possible that the > server (one of Brick 4, 5 or 6 ) that is down had the only good copy and the > other 2 online bricks had a bad copy (needing heal)? Clients can get EIO in > that case. So I did check for heals and saw nothing. The storage at this time was in a read-only use case. What I mean by that is the NFS clients mount it read only and there were no write activities going to shared storage anyway at that time. So it was not surprising that no heals were listed. I did inspect both remaining bricks for several of the example problem files and found them with matching md5sums. The strange thing, as I mentioned, is it only happened under the job launch workload. The nfs boot workload, which is also very stressful, ran clean with one brick down. > When you say accessing the file from the compute nodes afterwards works > fine, it is still with that one server (brick) down? I can no longer check this system personally but as I recall when we fixed the ethernet problem, all seemed well. I don't have a better answer for this one than that. I am starting a document of things to try when we have a large system in the factory to run on. I'll put this in there. > > There was a case of AFR reporting spurious split-brain errors but that was > fixed long back (http://review.gluster.org/16362 > ) and seems to be present in glusterf-4.1.6. So I brought this up. In my case, we know the files on the NFS client side really were missing because we saw errors on the clients. That is to say, the above bug seems to mean that split-brain was reported in error with no other impacts. However, in my case, the error resulted in actual problems accessing the files on the NFS clients. > Side note: Why are you using replica 9 for the ctdb volume? All > development/tests are usually done on (distributed) replica 3 setup. I am happy to change this. Whatever guide I used to set this up suggested replica 9. I don't even know which resource was incorrect as it was so long ago. I have no other reason. I'm filing an incident now to change our setup tools to use replica-3 for CTDB for new setups. Again, I appreciate that you followed up with me. Thank you, Erik ________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/118564314 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/118564314 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users