Gluster fuse mount freezes entire server until killed and remounted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

We have been observing a critical issue that started in the last several months and has already randomly affected our servers 3 times.


The symptoms:
  • df stops responding and hangs
  • sometimes apache and nginx stop responding and all requests hang, sometimes they continue working, though nginx returns incomplete results
Upon investigating the issue when it happened again today, I narrowed it down to glusterfs and specifically one of the fuse mount processes.

df freezes like this:
stat("/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/var/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/var/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/sys/kernel/debug/tracing", 0x7ffc32784ef0) = -1 EACCES (Permission denied)
stat("/mnt/androidpolicedata3", {stmode=SIFDIR|0755, stsize=4096, …}) = 0
stat("/mnt/apkmirror_data1", ^C^C^C^C^C

/mnt/apkmirrordata1 is a fuse mount by glusterfs corresponding to this attached block device:
/dev/disk/by-id/scsi-0LinodeVolumehiveblock1 /mnt/hive_block1 xfs defaults 0 2

It's pretty crazy that any access to this /mnt/apkmirror_data1 location freezes any program issuing the stat call indefinitely.

During this time, the block device itself was reachable and I could list files, so I have to assume the issue lies somewhere in glusterfs, fuse, or the kernel.

After killing this process
root 9485 1 6 Apr30 ? 18:36:26 /usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id=/apkmirrordata1 /mnt/apkmirrordata1
and remounting the fuse mount, everything returned back to normal.

One of my suspicions is the issue started when we upgraded our OpenSUSE 15.1 machines from 5.1.17 kernel to 5.4.10. Machines with 5.1.17 haven't experienced it, while only machines running 5.4.10 did. It took 15 days after the last reboot to hit the issue today, so it's very sporadic, but also very critical when it does hit.


Questions:
  1. How can we tell what specific fuse version is being used by gluster?
  2. Are there any gluster or fuse parameters that control the fuse timeout, so that perhaps it internally tries to remount if fuse hangs?
    Currently, it's mounted like this:
    localhost:/apkmirror_data1 /mnt/apkmirror_data1 glusterfs defaults,_netdev 0 0
  3. Does the team have any further thoughts or perhaps someone knows how to fix the issue or has seen a kernel or fuse/gluster advisory?
Thank you.

Sincerely,
Artem

--
Founder, Android PoliceAPK Mirror, Illogical Robot LLC
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux