Dispersed Volume Errors after failed expansion

edrock200 <edrock200@xxxxxxxxxxxxxx> · Fri, 10 Nov 2023 18:37:18 +0000

Hello,
I've run into an issue with Gluster 11.1 and need some assistance. I have a 4+1 dispersed gluster setup consisting of 20 nodes and 200 bricks. This setup was 15 nodes and 150 bricks until last week and was working flawlessly. We needed more space so we expanded the volume by adding 5 more nodes and 50 bricks.

We added the nodes and triggered a fix-layout command. Unknown to us at the time, one of the five new nodes had a hardware issue, the CPU cooling fan was bad. This caused the node to throttle down to 500mhz on all cores and eventually shut itself down mid fix-layout. Due to how our ISP works, we could only replace the entire node, so we did and executed a replace-brick command.

Presently this is the state we are in and I'm not sure how best to proceed to fix the errors and behavior I'm seeing. I'm not sure if running another fix-layout command again should be the next step or not given hundreds of objects are stuck in a persistent heal state, and the fact that doing just about any command other than status, info or heal volume info, results in all client mounts hanging for ~5m or bricks start to drop. The client logs show numerous anomolies as well such as:

[2023-11-10 17:41:52.153423 +0000] W [MSGID: 122040] [ec-common.c:1262:ec_prepare_update_cbk] 0-media-disperse-30: Failed to get size and version :  FOP : 'XATTROP' failed on '/path/to/folder' with gfid 0d295c94-5577-4445-9e57-6258f24d22c5. Parent FOP: OPENDIR [Input/output error]

[2023-11-10 17:48:46.965415 +0000] E [MSGID: 122038] [ec-dir-read.c:398:ec_manager_readdir] 0-media-disperse-36: EC is not winding readdir: FOP : 'READDIRP' failed on gfid f8ad28d0-05b4-4df3-91ea-73fabf27712c. Parent FOP: No Parent [File descriptor in bad state]

[2023-11-10 17:39:46.076149 +0000] I [MSGID: 109018] [dht-common.c:1840:dht_revalidate_cbk] 0-media-dht: Mismatching layouts for /path/to/folder2, gfid = f04124e5-63e6-4ddf-9b6b-aa47770f90f2 

[2023-11-10 17:39:18.463421 +0000] E [MSGID: 122034] [ec-common.c:662:ec_log_insufficient_vol] 0-media-disperse-4: Insufficient available children for this request: Have : 0, Need : 4 : Child UP : 11111 Mask: 00000, Healing : 00000 : FOP : 'XATTROP' failed on '/path/to/another/folder with gfid f04124e5-63e6-4ddf-9b6b-aa47770f90f2. Parent FOP: SETXATTR 

[2023-11-10 17:36:21.565681 +0000] W [MSGID: 122006] [ec-combine.c:188:ec_iatt_combine] 0-media-disperse-39: Failed to combine iatt (inode: 13324146332441721129-13324146332441721129, links: 2-2, uid: 1000-1000, gid: 1000-1001, rdev: 0-0, size: 10-10, mode: 40775-40775), FOP : 'LOOKUP' failed on '/path/to/yet/another/folder'. Parent FOP: No Parent 

[2023-11-10 17:39:46.147299 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2563:client4_0_lookup_cbk] 0-media-client-1: remote operation failed. [{path=/path/to/folder3}, {gfid=00000000-0000-0000-0000-000000000000}, {errno=13}, {error=Permission denied}] 

[2023-11-10 17:39:46.093069 +0000] W [MSGID: 114061] [client-common.c:1232:client_pre_readdirp_v2] 0-media-client-14: remote_fd is -1. EBADFD [{gfid=f04124e5-63e6-4ddf-9b6b-aa47770f90f2}, {errno=77}, {error=File descriptor in bad state}] 

[2023-11-10 17:55:11.407630 +0000] E [MSGID: 122038] [ec-dir-read.c:398:ec_manager_readdir] 0-media-disperse-30: EC is not winding readdir: FOP : 'READDIRP' failed on gfid 2bba7b7e-7a4b-416a-80f0-dd50caffd2c2. Parent FOP: No Parent [File descriptor in bad state]

[2023-11-10 17:39:46.076179 +0000] W [MSGID: 109221] [dht-selfheal.c:2023:dht_selfheal_directory] 0-media-dht: Directory selfheal failed [{path=/path/to/folder7}, {misc=2}, {unrecoverable-errors}, {gfid=f04124e5-63e6-4ddf-9b6b-aa47770f90f2}] 

Something about this failed expansion has caused these errors and I'm not sure how to proceed. Right now doing just about anything causes the client mounts to hang for up to 5 minutes including restarting a node, trying to use a volume set command, etc. I tried increasing a cache timeout value and ~153 bricks out of 200 dropped offline. Restarting a node seems to cause the mounts to hang as well.

I've tried:
running a gluster volume heal volumename full - will cause mounts to hang for 3-5m but seems to proceed
Running ls -alhR against volume to trigger heals
Tried removing new bricks, which triggers a rebalance which fails almost immediately, and most of the self-heal agents go offline as well
Turned off bit-rot to reduce load on system
Replace a brick with a new brick (same drive, new dir.) Attempted force as well.
Changed heal mode from diff to full
Lowered parallel heal count to 4

When I replaced the one brick, the heal count dropped on that brick from ~100 to ~6, however, those 6 are folders in the root of the volume vs subfolders many layers in. I suspect this is causing a lot of the issues I'm seeing and I don't know how to resolve this without damaging any of the existing data.

I'm hoping its just due to the fix layout failing and that just needs to run again but wanted to seek guidance from the group as to not make things worse. I'm not opposed to losing the data already copied to the new bricks, I just need to know how to do so without damaging the data on the original 150 bricks. 

I did notice something else odd as well which I'm not sure is pertinent or not, but on one of the original 15 nodes, if I go to /data/brick1/volume dir and to an ls -l, the permissions show 1000:1000, which is how it is on the actual fuse mount as well. If I do the same on one of the new bricks, it shows root:root. I didn't alter any of this, again as to not cause more problems. 

Thanks in advance for any guidance/help.
-Ed

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users