Hi all, I have a problem with heal, I have 995 files that fails with heal. Gluster version: 9.3 OS: Debian Bullseye My setup is a replicate with an arbiter: Volume Name: gds-admin Type: Replicate Volume ID: f1f112f4-8cee-4c04-8ea5-c7d895c8c8d6 Status: Started Snapshot Count: 8 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: urd-gds-001:/urd-gds/gds-admin Brick2: urd-gds-002:/urd-gds/gds-admin Brick3: urd-gds-000:/urd-gds/gds-admin (arbiter) Options Reconfigured: storage.build-pgfid: off performance.client-io-threads: off nfs.disable: on transport.address-family: inet storage.fips-mode-rchecksum: on features.barrier: disable Gluster volume status: Status of volume: gds-admin Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick urd-gds-001:/urd-gds/gds-admin 49155 0 Y 6964 Brick urd-gds-002:/urd-gds/gds-admin 49155 0 Y 4270 Brick urd-gds-000:/urd-gds/gds-admin 49152 0 Y 1175 Self-heal Daemon on localhost N/A N/A Y 7031 Self-heal Daemon on urd-gds-002 N/A N/A Y 4281 Self-heal Daemon on urd-gds-000 N/A N/A Y 1230 Task Status of Volume gds-admin ------------------------------------------------------------------------------ There are no active volume tasks Gluster pool list: UUID Hostname State 8823d0d9-5d02-4f47-86e9- urd-gds-000 Connected 73139305-08f5-42c2-92b6- urd-gds-002 Connected d612a705-8493-474e-9fdc- localhost Connected info summary says: Brick urd-gds-001:/urd-gds/gds-admin Status: Connected Total Number of entries: 995 Number of entries in heal pending: 995 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick urd-gds-002:/urd-gds/gds-admin Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick urd-gds-000:/urd-gds/gds-admin Status: Connected Total Number of entries: 995 Number of entries in heal pending: 995 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Statistics says (on both node urd-gds-000 and urd-gds-001): Starting time of crawl: Tue Oct 5 14:25:08 2021 Ending time of crawl: Tue Oct 5 14:25:25 2021 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 995 To me it seems as if node urd-gds-002 has old version of files. I tried 2 files that had filenames and both urd-gds-000 and urd-gds-001 has the same gfid for the file and the same timestamp for the file. Node urd-gds-002 has a different gfid and an older timestamp. The client could not access the file. I manually removed the file and gfid file from urd-gds-002 and these files were healed. I have a long list of files with just gfids (995). I tried to get the file path with (example): getfattr -n trusted.glusterfs.pathinfo -e text /mnt/gds-admin/.gfid/4e203eb1-795e-433a-9403-753ba56575fd getfattr: Removing leading '/' from absolute path names # file: mnt/gds-admin/.gfid/4e203eb1-795e-433a-9403-753ba56575fd trusted.glusterfs.pathinfo="(<REPLICATE:gds-admin-replicate-0> <POSIX(/urd-gds/gds-admin):urd-gds-000:/urd-gds/gds-admin/.glusterfs/30/70/3070276f-1096-44c8-b9e9-62625620aba3/04> <POSIX(/urd-gds/gds-admin):urd-gds-001:/urd-gds/gds-admin/.glusterfs/30/70/3070276f-1096-44c8-b9e9-62625620aba3/04>)" This tells me that the file exists on node urd-gds-000 and urd-gds-001. I have been looking through the glustershd.log and I see the similar error over and over again on urd-gds-000 and urd-gds-001: [2021-10-05 12:46:01.095509 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1052:afr_selfheal_entry_do] 0-gds-admin-replicate-0: performing entry selfheal on d0d8b20e-c9df-4b8b-ac2e-24697fdf9201 [2021-10-05 12:46:01.802920 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:46:01.803538 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:46:01.803612 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:46:01.908395 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1052:afr_selfheal_entry_do] 0-gds-admin-replicate-0: performing entry selfheal on 0e309af2-2538-440a-8fd0-392620e83d05 [2021-10-05 12:46:01.914909 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:46:01.915225 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:46:01.915230 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] On urd-gds-002 I get same error over and over again: [2021-10-05 12:34:38.013434 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:34:38.013576 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:34:38.013948 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:44:39.011771 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:44:39.011825 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:44:39.012306 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:54:40.017676 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:54:40.018240 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-10-05 12:54:40.021305 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] It seems to gradualy become less and less entries, over night it has been reduced from 995 to 972. If I do an ls from the client side in some direcotories some names of the files shows up in info summary and then dissapears after a while. I would really appreciate some help on how to resolve this issue! Many thanks! Best regards Marcus --- När du skickar e-post till SLU så innebär detta att SLU behandlar dina personuppgifter. För att läsa mer om hur detta går till, klicka här <https://www.slu.se/om-slu/kontakta-slu/personuppgifter/> E-mailing SLU will result in SLU processing your personal data. For more information on how this is done, click here <https://www.slu.se/en/about-slu/contact-slu/personal-data/> ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users