On 29/03/20 9:40 am, Erik Jacobson wrote:
Hello all,
I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.
In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.
If all leaders are up, we see no errors.
Here are example nfs.log errors:
[2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error]
Since you say that the errors go away when all 3 bricks (which I guess
is what you refer to as 'leaders') of the replica are up, it could be
possible that the brick you brought down had the only good copy. In such
cases, even though you have the other 2 bricks of the replica up, they
both are bad copies waiting to be healed and hence all operations on
those files will fail with EIO. Since you say this occurs under high
load only. I suspect this is the case since heal hasn't had the time to
catch up with the nodes going up and down.
If you see the split-brain errors despite all 3 replica bricks being
online and the gnfs server being able to connect to all of them, then it
could be a genuine split-brain problem. But I don't think that is the
case here.
Regards,
Ravi
________
Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users