Il 21/08/20 13:56, Diego Zuccato ha scritto: Hello again. I also tried disabling bitrot (and re-enabling it afterwards) and the procedure for recovery from split-brain[*] removing the file and its link from one of the nodes, but no luck. I'm now completely out of ideas :( How can I resync those gfids ? Tks! Diego [*] even if "gluster volume heal BigVol info split-brain" reports 0 for every brick. > Hello all. > > I have a volume setup as: > -8<-- > root@str957-biostor:~# gluster v info BigVol > > Volume Name: BigVol > Type: Distributed-Replicate > Volume ID: c51926bd-6715-46b2-8bb3-8c915ec47e28 > Status: Started > Snapshot Count: 0 > Number of Bricks: 28 x (2 + 1) = 84 > Transport-type: tcp > Bricks: > Brick1: str957-biostor2:/srv/bricks/00/BigVol > Brick2: str957-biostor:/srv/bricks/00/BigVol > Brick3: str957-biostq:/srv/arbiters/00/BigVol (arbiter) > [...] > Options Reconfigured: > cluster.granular-entry-heal: enable > client.event-threads: 8 > server.event-threads: 8 > server.ssl: on > client.ssl: on > nfs.disable: on > performance.readdir-ahead: on > transport.address-family: inet > features.bitrot: on > features.scrub: Active > features.scrub-freq: biweekly > auth.ssl-allow: str957-bio* > ssl.certificate-depth: 1 > cluster.self-heal-daemon: enable > features.quota: on > features.inode-quota: on > features.quota-deem-statfs: on > server.manage-gids: on > features.scrub-throttle: aggressive > -8<-- > > After a couple failures (a disk on biostor2 went "missing", and glusterd > on biostq got killed by OOM) I noticed that some files can't be accessed > from the clients: > -8<-- > $ ls -lh 1_germline_CGTACTAG_L005_R* > -rwxr-xr-x 1 e.f domain^users 2,0G apr 24 2015 > 1_germline_CGTACTAG_L005_R1_001.fastq.gz > -rwxr-xr-x 1 e.f domain^users 2,0G apr 24 2015 > 1_germline_CGTACTAG_L005_R2_001.fastq.gz > $ ls -lh 1_germline_CGTACTAG_L005_R1_001.fastq.gz > ls: cannot access '1_germline_CGTACTAG_L005_R1_001.fastq.gz': > Input/output error > -8<-- > (note that if I request ls for more files, it works...). > > The files have exactly the same contents (verified via md5sum). The only > difference is in getfattr: trusted.bit-rot.version is > 0x17000000000000005f3f9e670002ad5b on a node and > 0x12000000000000005f3ce7af000dccad on the other. > > On the client, the log reports: > -8<- > [2020-08-21 11:32:52.208809] W [MSGID: 108008] > [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check] > 4-BigVol-replicate-13: GFID mismatch for > <gfid:5217fe67-4dd0-47a1-8d27-143ae912ef4a>/1_germline_CGTACTAG_L005_R1_001.fastq.gz > d70a4a6d-05fc-4988-8041-5e7f62155fe5 on BigVol-client-55 and > f249f88a-909f-489d-8d1d-d428e842ee96 on BigVol-client-34 > [2020-08-21 11:32:52.209768] W [fuse-bridge.c:471:fuse_entry_cbk] > 0-glusterfs-fuse: 233606: LOOKUP() > /[...]/1_germline_CGTACTAG_L005_R1_001.fastq.gz => -1 (Errore di > input/output) > -8<-- > > As suggested on IRC, I tested the RAM, but the only thing I got have > been a "Peer rejected" status due to another OOM kill. No problem, I've > been able to resolve it, but the original problem still remains. > > What else can I do? > > TIA! > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > https://lists.gluster.org/mailman/listinfo/gluster-users > -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users