Re: kvm vm cephfs mount hangs on osd node (something like umount -l available?) (help wanted going to production)

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Tue, 22 Dec 2020 10:47:24 +0100

Just got this during bonnie test, trying to do an ls -l on the cephfs. I 
also have this kworker process constantly at 40% when doing this 
bonnie++ test.

[35281.101763] INFO: task bash:1169 blocked for more than 120 seconds.
[35281.102064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[35281.102175] bash            D ffffa03fbfc9acc0     0  1169   1167 
0x00000004
[35281.102181] Call Trace:
[35281.102275]  [<ffffffff84b86d4f>] ? __schedule+0x3af/0x860
[35281.102285]  [<ffffffff84b87229>] schedule+0x29/0x70
[35281.102296]  [<ffffffff84b84d11>] schedule_timeout+0x221/0x2d0
[35281.102332]  [<ffffffff844c6966>] ? finish_wait+0x56/0x70
[35281.102342]  [<ffffffff84b85482>] ? mutex_lock+0x12/0x2f
[35281.102381]  [<ffffffff846e7ed8>] ? autofs4_wait+0x428/0x920
[35281.102386]  [<ffffffff84b875dd>] wait_for_completion+0xfd/0x140
[35281.102407]  [<ffffffff844daf40>] ? wake_up_state+0x20/0x20
[35281.102422]  [<ffffffff846e902b>] autofs4_expire_wait+0xab/0x160
[35281.102425]  [<ffffffff846e6060>] do_expire_wait+0x1e0/0x210
[35281.102429]  [<ffffffff846e62b3>] autofs4_d_manage+0x73/0x1c0
[35281.102455]  [<ffffffff84658e8a>] follow_managed+0xba/0x310
[35281.102459]  [<ffffffff84659e5d>] lookup_fast+0x12d/0x230
[35281.102464]  [<ffffffff8465c90d>] path_lookupat+0x16d/0x8d0
[35281.102467]  [<ffffffff8465deed>] ? do_last+0x66d/0x1340
[35281.102488]  [<ffffffff8464a73a>] ? __check_object_size+0x1ca/0x250
[35281.102499]  [<ffffffff84628675>] ? kmem_cache_alloc+0x35/0x1f0
[35281.102503]  [<ffffffff8465fc0f>] ? getname_flags+0x4f/0x1a0
[35281.102507]  [<ffffffff8465d09b>] filename_lookup+0x2b/0xc0
[35281.102510]  [<ffffffff84660da7>] user_path_at_empty+0x67/0xc0
[35281.102513]  [<ffffffff84660e11>] user_path_at+0x11/0x20
[35281.102516]  [<ffffffff84653603>] vfs_fstatat+0x63/0xc0
[35281.102519]  [<ffffffff846539be>] SYSC_newstat+0x2e/0x60
[35281.102529]  [<ffffffff84b94ed5>] ? 
system_call_after_swapgs+0xa2/0x13a
[35281.102533]  [<ffffffff84b94ec9>] ? 
system_call_after_swapgs+0x96/0x13a
[35281.102536]  [<ffffffff84b94ed5>] ? 
system_call_after_swapgs+0xa2/0x13a
[35281.102539]  [<ffffffff84b94ec9>] ? 
system_call_after_swapgs+0x96/0x13a
[35281.102543]  [<ffffffff84b94ed5>] ? 
system_call_after_swapgs+0xa2/0x13a
[35281.102546]  [<ffffffff84b94ec9>] ? 
system_call_after_swapgs+0x96/0x13a
[35281.102549]  [<ffffffff84b94ed5>] ? 
system_call_after_swapgs+0xa2/0x13a
[35281.102552]  [<ffffffff84b94ec9>] ? 
system_call_after_swapgs+0x96/0x13a
[35281.102555]  [<ffffffff84b94ed5>] ? 
system_call_after_swapgs+0xa2/0x13a
[35281.102558]  [<ffffffff84b94ec9>] ? 
system_call_after_swapgs+0x96/0x13a
[35281.102561]  [<ffffffff84b94ed5>] ? 
system_call_after_swapgs+0xa2/0x13a
[35281.102565]  [<ffffffff84653e7e>] SyS_newstat+0xe/0x10
[35281.102568]  [<ffffffff84b94f92>] system_call_fastpath+0x25/0x2a
[35281.102572]  [<ffffffff84b94ed5>] ? 
system_call_after_swapgs+0xa2/0x13a

-----Original Message-----
To: ceph-users
Subject:  kvm vm cephfs mount hangs on osd node (something 
like umount -l available?) (help wanted going to production)

I have a vm on a osd node (which can reach host and other nodes via the 
macvtap interface (used by the host and guest)). I just did a simple 
bonnie++ test and everything seems to be fine. Yesterday however the
dovecot procces apparently caused problems (only using cephfs for an 
archive namespace, inbox is on rbd ssd, fs meta also on ssd)

How can I recover from such lock-up. If I have a similar situation with 
an nfs-ganesha mount, I have the option to do a umount -l, and clients 
recover quickly without any issues.

Having to reset the vm, is not really an option. What is best way to 
resolve this?

Ceph cluster: 14.2.11 (the vm has 14.2.16)

I have in my ceph.conf nothing special, these 2x in the mds section:

mds bal fragment size max = 120000
# maybe for nfs-ganesha problems?
# http://docs.ceph.com/docs/master/cephfs/eviction/
#mds_session_blacklist_on_timeout = false
#mds_session_blacklist_on_evict = false
mds_cache_memory_limit = 17179860387

All running:
CentOS Linux release 7.9.2009 (Core)
Linux mail04 3.10.0-1160.6.1.el7.x86_64 #1 SMP Tue Nov 17 13:59:11 UTC 
2020 x86_64 x86_64 x86_64 GNU/Linux
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx