Hi gluster users, I'm having an issue that I'm hoping to get some help with on a dispersed volume (EC: 2x(4+2)) that's causing me some headaches. This is on a cluster running Gluster 6.9 on CentOS 7. At some point in the last week, writes to one of my bricks have started failing due to an "No Space Left on Device" error: [2021-07-06 16:08:57.261307] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-gluster-01-server: 1853436561: WRITEV -2 (f2d6f2f8-4fd7-4692-bd60-23124897be54), client: CTX_ID:648a7383-46c8-4ed7-a921-acafc90bec1a-GRAPH_ID:4-PID:19471-HOST:rhevh08.mgmt.triumf.ca-PC_NAME:gluster-01-client-5-RECON_NO:-5, error-xlator: gluster-01-posix [No space left on device] The disk is quite full (listed as 100% on the server), but does have some writable room left: /dev/mapper/vg--brick1-brick1 11T 11T 97G 100% /data/glusterfs/gluster-01/brick1 however, I'm not sure if the amount of disk space used on the physical drive is the true cause of the "No Space Left on Device" errors anyway. I can still manually write to this brick outside of Gluster, so it seems like the operating system isn't preventing the writes from happening. During my investigation, I noticed that one .glusterfs paths on the problem server is using up much more space than it is on the other servers. I can't quite figure out why that might be, or how that happened. I'm wondering if there's any advice on what the cause might've been. I had done some package updates on this server with the issue and not on the other servers. This included the kernel version, but didn't include the Gluster packages. So possibly this, or the reboot to load the new kernel may have caused a problem. I have scripts on my gluster machines to nicely kill all of the brick processes before rebooting, so I'm not leaning towards an abrupt shutdown being the cause, but it's a possibility. I'm also looking for advice on how to safely remove the problem file and rebuild it from the other Gluster peers. I've seen some documentation on this, but I'm a little nervous about corrupting the volume if I misunderstand the process. I'm not free to take the volume or cluster down and do maintenance at this point, but that might be something I'll have to consider if it's my only option. For reference, here's the comparison of the same path that seems to be taking up extra space on one of the hosts: 1: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 2: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 3: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 4: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 5: 26G /data/gluster-01/brick1/vol/.glusterfs/99/56 6: 3.0T /data/gluster-01/brick1/vol/.glusterfs/99/56 Any and all advice is appreciated. Thanks! -- Daniel Thomson DevOps Engineer t +1 604 222 7428 dthomson@xxxxxxxxx TRIUMF Canada's particle accelerator centre www.triumf.ca @TRIUMFLab 4004 Wesbrook Mall Vancouver BC V6T 2A3 Canada
Attachment:
signature.asc
Description: PGP signature
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users