> Yes if the dataset is small, you can try rm -rf of the dir > from the mount (assuming no other application is accessing > them on the volume) launch heal once so that the heal info > becomes zero and then copy it over again . I did approximately so; the rm -rf took its sweet time and the number of entries to be healed kept diminishing as the deletion progressed. At the end I was left with Mon Mar 15 22:57:09 CET 2021 Gathering count of entries to be healed on volume gv0 has been successful Brick node01:/gfs/gv0 Number of entries: 3 Brick mikrivouli:/gfs/gv0 Number of entries: 2 Brick nanosaurus:/gfs/gv0 Number of entries: 3 -------------- and that's where I've been ever since, for the past 20 hours. SHD has kept trying to heal them all along and the log brings us back to square one: [2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100f [2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [...] In other words, deleting and recreating the unhealable files and directories was a workaround, but the underlying problem persists and I can't even begin to look for it when I have no clue what errno 22 means in plain English. In any case, glusterd.log is full of messages like [2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume gv0 [2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}] Every single "received heal vol req" message is immediately followed by a "dict get failed", always for server-quorum-type, for hours on end. And I begin to smell a bug. The CLI can query the value OK: # gluster volume get gv0 cluster.server-quorum-type Option Value ------ ----- cluster.server-quorum-type off Checking all quorum-related settings, I get # gluster volume get gv0 all |grep quorum cluster.quorum-type auto cluster.quorum-count (null) (DEFAULT) cluster.server-quorum-type off cluster.server-quorum-ratio 51 cluster.quorum-reads no (DEFAULT) disperse.quorum-count 0 (DEFAULT) I never touched any of them and none of them appear in volume info under "Options Reconfigured", so don't know why three of them are not marked as defaults. Next, I tried setting server-quorum-type=server. The server-quorum-type problem went away and I got a new kind of dict get failure: The message "E [MSGID: 106061] [glusterd-volgen.c:2564:brick_graph_add_pump] 0-management: Dict get failed [{Key=enable-pump}]" repeated 2 times between [2021-03-16 17:12:18.677594 +0000] and [2021-03-16 17:12:18.779859 +0000] I tried rolling back server-quorum-type=server and got this error: # gluster volume set gv0 cluster.server-quorum-type off volume set: failed: option server-quorum-type off: 'off' is not valid (possible options are none, server.) Aha, but previously and by default it was clearly "off", not "none". That's bug somewhere and that is what was causing the dict get failures on server-quorum-type. The missing dict enable-pump that's required by server-quorum-type=server looks also like a bug because there is no such setting: # gluster volume get gv0 all |grep pump # There are more similarly strange complaints in the glusterd log: [2021-03-16 17:25:43.134207 +0000] E [MSGID: 106434] [glusterd-utils.c:13379:glusterd_get_value_for_vme_entry] 0-management: xlator_volopt_dynload error (-1) [2021-03-16 17:25:43.141816 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for localtime-logging key [2021-03-16 17:25:43.143185 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-seckey key [2021-03-16 17:25:43.143340 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-keyid key [2021-03-16 17:25:43.143484 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-bucketid key [2021-03-16 17:25:43.143621 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-hostname key If none of this stuff is used in the first place, it should not be triggering errors and warnings. If the S3 plugin is not enabled, the S3 keys should not even be checked. Both the checking of the keys and the error logging are bugs. Cool, I'm discovering more and more stuff that needs fixing, but I'm making zero progress with my healing problem. I'm still stuck with errno=22. ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users