Re: split-brain errors under heavy load when one brick down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 16/09/19 7:34 pm, Erik Jacobson wrote:
Example errors:

ex1

[2019-09-06 18:26:42.665050] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed.
[Input/output error]
Okay so 0-cm_shared-replicate-1 means these 3 bricks:

Brick4: 172.23.0.6:/data/brick_cm_shared
Brick5: 172.23.0.7:/data/brick_cm_shared
Brick6: 172.23.0.8:/data/brick_cm_shared


ex2

[2019-09-06 18:26:55.359272] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed.
[Input/output error]
[2019-09-06 18:26:55.359367] W [MSGID: 112199]
[nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3:
/image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80,
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)



The errors seem to happen only on the 'replicate' volume where one
server is down in the subvolume (of course, any NFS server will
trigger that when it accesses the files on the degraded volume).

Were there any pending self-heals for this volume? Is it possible that the server (one of Brick 4, 5 or 6 ) that is down had the only good copy and the other 2 online bricks had a bad copy (needing heal)? Clients can get EIO in that case.

When you say accessing the file from the compute nodes afterwards works fine, it is still with that one server (brick) down?

There was a case of AFR reporting spurious split-brain errors but that was fixed long back (http://review.gluster.org/16362) and seems to be present in glusterf-4.1.6.

Side note: Why are you using replica 9 for the ctdb volume? All development/tests are usually done on (distributed) replica 3 setup.

Thanks,

Ravi

________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users



[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux