Hi all,
We have been observing a critical issue that started in the last several months and has
already
randomly affected our servers 3 times.
The symptoms:
- df stops responding and hangs
- sometimes apache and nginx stop responding and all requests hang, sometimes they continue working, though nginx returns incomplete results
Upon investigating the issue when it happened again today, I narrowed it down to glusterfs and specifically one of the fuse mount processes.
df freezes like this:
stat("/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/var/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/var/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/sys/kernel/debug/tracing", 0x7ffc32784ef0) = -1 EACCES (Permission denied)
stat("/mnt/androidpolicedata3", {stmode=SIFDIR|0755, stsize=4096, …}) = 0
stat("/mnt/apkmirror_data1", ^C^C^C^C^C
/mnt/apkmirrordata1 is a fuse mount by glusterfs corresponding to this attached block device:
/dev/disk/by-id/scsi-0LinodeVolumehiveblock1 /mnt/hive_block1 xfs defaults 0 2
It's pretty crazy that any access to this /mnt/apkmirror_data1 location freezes any program issuing the stat call indefinitely.
During this time, the block device itself was reachable and I could list files, so I have to assume the issue lies somewhere in glusterfs, fuse, or the kernel.
After killing this process
root 9485 1 6 Apr30 ? 18:36:26 /usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id=/apkmirrordata1 /mnt/apkmirrordata1
and remounting the fuse mount, everything returned back to normal.
stat("/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/var/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/var/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/sys/kernel/debug/tracing", 0x7ffc32784ef0) = -1 EACCES (Permission denied)
stat("/mnt/androidpolicedata3", {stmode=SIFDIR|0755, stsize=4096, …}) = 0
stat("/mnt/apkmirror_data1", ^C^C^C^C^C
/mnt/apkmirrordata1 is a fuse mount by glusterfs corresponding to this attached block device:
/dev/disk/by-id/scsi-0LinodeVolumehiveblock1 /mnt/hive_block1 xfs defaults 0 2
It's pretty crazy that any access to this /mnt/apkmirror_data1 location freezes any program issuing the stat call indefinitely.
During this time, the block device itself was reachable and I could list files, so I have to assume the issue lies somewhere in glusterfs, fuse, or the kernel.
After killing this process
root 9485 1 6 Apr30 ? 18:36:26 /usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id=/apkmirrordata1 /mnt/apkmirrordata1
and remounting the fuse mount, everything returned back to normal.
One of my suspicions is the issue started when we upgraded our OpenSUSE 15.1 machines from 5.1.17 kernel to 5.4.10. Machines with 5.1.17 haven't experienced it, while only machines running 5.4.10 did. It took 15 days after the last reboot to hit the issue today, so it's very sporadic, but also very critical when it does hit.
Questions:
- How can we tell what specific fuse version is being used by gluster?
- Are there any gluster or fuse parameters that control the fuse timeout, so that perhaps it internally tries to remount if fuse hangs?
Currently, it's mounted like this:
localhost:/apkmirror_data1 /mnt/apkmirror_data1 glusterfs defaults,_netdev 0 0 - Does the team have any further thoughts or perhaps someone knows how to fix the issue or has seen a kernel or fuse/gluster advisory?
Thank you.
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users