Mounting of Gluster volumes in Kubernetes

Travis Truman <travis.truman@xxxxxxxxxxxx> · Wed, 18 Oct 2017 12:52:57 +0000

Hi all,
Wondered if there are others in the community using GlusterFS on Google Compute Engine and Kubernetes via Google Container Engine together.

We're running glusterfs 3.7.6 on Ubuntu Xenial across 3 GCE nodes. We have a single replicated volume of ~800GB that our pods running in Kubernetes are mounting.

We've observed a pattern of soft lockups on our Kubernetes nodes that mount our Gluster volume. These nodes seem to be those that have the highest rate of reads/writes to the Gluster volume.

An example looks like:

[495498.074071] Kernel panic - not syncing: softlockup: hung tasks
[495498.080108] CPU: 0 PID: 10166 Comm: nginx Tainted: G             L  4.4.64+ #1
[495498.087524] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[495498.096947]  0000000000000000 ffff8803ffc03e20 ffffffffa1317394 ffffffffa1713537
[495498.105113]  ffff8803ffc03eb0 ffff8803ffc03ea0 ffffffffa1139bbc 0000000000000008
[495498.113187]  ffff8803ffc03eb0 ffff8803ffc03e48 000000000000009c 0000000000000000
[495498.121488] Call Trace:
[495498.124131]  <IRQ>  [<ffffffffa1317394>] dump_stack+0x63/0x8f
[495498.130207]  [<ffffffffa1139bbc>] panic+0xc6/0x1ec
[495498.135208]  [<ffffffffa10f65a7>] watchdog_timer_fn+0x1e7/0x1f0
[495498.141327]  [<ffffffffa10f63c0>] ? watchdog+0xa0/0xa0
[495498.146668]  [<ffffffffa10b8f1f>] __hrtimer_run_queues+0xff/0x260
[495498.152959]  [<ffffffffa10b93ec>] hrtimer_interrupt+0xac/0x1b0
[495498.158993]  [<ffffffffa15b2918>] smp_apic_timer_interrupt+0x68/0xa0
[495498.167232]  [<ffffffffa15b1222>] apic_timer_interrupt+0x82/0x90
[495498.173432]  <EOI>  [<ffffffffa109a6d0>] ? prepare_to_wait_exclusive+0x80/0x80
[495498.182557]  [<ffffffffc02e331f>] ? 0xffffffffc02e331f
[495498.187893]  [<ffffffffa109a9e0>] ? prepare_to_wait_event+0xf0/0xf0
[495498.194357]  [<ffffffffc02e3679>] 0xffffffffc02e3679
[495498.199519]  [<ffffffffc02e723a>] fuse_simple_request+0x11a/0x1e0 [fuse]
[495498.206415]  [<ffffffffc02e7f71>] fuse_dev_cleanup+0xa81/0x1ef0 [fuse]
[495498.213151]  [<ffffffffa11b91a9>] lookup_fast+0x249/0x330
[495498.218748]  [<ffffffffa11b95bd>] walk_component+0x3d/0x500

While the particular issue seems more related to the Fuse client talking to Gluster, we're wondering if others have seen this type of behavior, if there are particular troubleshooting/tuning steps we might be advised to the take on the Gluster side of the problem, and if the community has any general tips around using Gluster and Kubernetes together.

Thanks in advance,
Travis Truman
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users