Thank you so much for replying -- > > [2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error] > Since you say that the errors go away when all 3 bricks (which I guess is > what you refer to as 'leaders') of the replica are up, it could be possible Yes leaders == gluster+gnfs server for this. We use 'leader' internally for mean servers that help manage compute nodes. I try to convert it to 'server' in my writing but 'leader' slips out somtimes. > that the brick you brought down had the only good copy. In such cases, even > though you have the other 2 bricks of the replica up, they both are bad I think all 3 copies are good. That is because the same exact files are accessed the same way when nodes boot. With one server down, 76 nodes normally boot with no errors. Once in a while one fails with split brain errors in the log. The more load I put in, the more likely a split brain when one server is down. So that's why my test case is so weird looking. It has to generate a bunch of extra load and then try to access root filesystem files using our tools to trigger the split brain. The test is good in that it produces at least a couple slit-brain errors every time. I'm actually ver happy to have a test case. We've been dealing with reports for some time. The healing errors seen are explained by the writable XFS image files in gluster -- one per node -- that the nodes use for their /etc, /var, and so on. So the 76 healing messages were expected. If it would help to reduce confusion, I can repeat the test with using TMPFS for the writable areas so that the healing list is clear. > copies waiting to be healed and hence all operations on those files will > fail with EIO. Since you say this occurs under high load only. I suspect To be clear, with one server down, operations work like 99.9% of the time. Same operations on every node. It's only when we bring the load up (maybe heavy metadata related?) do we get split-brain errors with one server down. It is a strange problem but I don't believe there is a problem with any copy of any file. Never say never and nothing would make me happier than being wrong and solving the problem. I want to thank you so much for writing back. I'm willing to try any suggestions we come up with. Erik ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users