Re: CephFS error: currently failed to rdlock, waiting. clients crashing and evicted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thomas,

This is config controled by mds's mds_cap_revoke_eviction_timeout(300s
by default). If the client crashed or hung for long time, the cluster
will evict the client.

It can prevent others hung(waiting for locks). If you're the client will
recover later, you can set it zero.

Hoping this helps.

Yours, Norman

On 18/11/2020 上午6:49, Thomas Hukkelberg wrote:
Hi all!

Hopefully some of you can shed some light on this. We have big problems with samba crashing when macOS smb clients access certain/random folders/files over vfs_ceph.

When browsing cephfs folder in question directly on a cephnode where cephfs is mouted we experience some issues like slow dir listing. We suspect that maybe macOS fetching of xattr metadata creates a lot of traffic, but it should not lockup the cluster like this. In logs we see both rdlock and wrlock, but mostly rdlocks.

End clients experience spurious disconnects when issue occurs, roughly up to a handfull times a day. Is this a config issue? Have we hit a bug? It's certainly not a feature :/

Any pointers on how to troubleshoot or rectify this problem is most welcome.

ceph version 14.2.11
samba version 4.12.10-SerNet-Ubuntu-10.focal
Supermicro X11, Intel Silver 4110, 9 ceph nodes, 2x40gbe network, 150OSD spinners, NVMe db/journal

--

2020-11-17 22:09:07.525706 [WRN] evicting unresponsive client bo-samba-03 (3887652779), after 301.746 seconds
2020-11-17 22:09:07.525580 [INF] Evicting (and blacklisting) client session 3877970532 (10.40.30.133:0/3971626932)
2020-11-17 22:09:07.525536 [WRN] evicting unresponsive client bo-samba-03 (3877970532), after 302.034 seconds
2020-11-17 22:07:23.915412 [INF] Cluster is now healthy
2020-11-17 22:07:23.915381 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2020-11-17 22:07:23.915330 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release)
2020-11-17 22:07:23.064492 [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs
2020-11-17 22:07:23.064457 [INF] MDS health message cleared (mds.?): Client bo-samba-03 failing to respond to capability release
2020-11-17 22:07:17.524023 [WRN] client.3887663354 isn't responding to mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, sent 63.325997 seconds ago
2020-11-17 22:07:17.523987 [INF] Evicting (and blacklisting) client session 3887663354 (10.40.30.133:0/3230547239)
2020-11-17 22:07:17.523967 [WRN] evicting unresponsive client bo-samba-03 (3887663354), after 64.5412 seconds
2020-11-17 22:07:17.523610 [WRN] slow request 63.325528 seconds old, received at 2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup #0x100011f9a68/mappe uten navn 2020-11-17 22:06:14.197908 caller_uid=111139, caller_gid=110513{}) currently failed to rdlock, waiting
2020-11-17 22:07:17.523596 [WRN] 1 slow requests, 1 included below; oldest blocked for > 63.325529 secs
2020-11-17 22:07:19.255177 [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:07:12.523453 [WRN] 1 slow requests, 0 included below; oldest blocked for > 58.325433 secs
2020-11-17 22:07:07.523382 [WRN] 1 slow requests, 0 included below; oldest blocked for > 53.325362 secs
2020-11-17 22:07:02.523360 [WRN] 1 slow requests, 0 included below; oldest blocked for > 48.325307 secs
2020-11-17 22:06:57.523218 [WRN] 1 slow requests, 0 included below; oldest blocked for > 43.325199 secs
2020-11-17 22:06:52.523203 [WRN] 1 slow requests, 0 included below; oldest blocked for > 38.325158 secs
2020-11-17 22:06:47.523105 [WRN] slow request 33.325065 seconds old, received at 2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup #0x100011f9a68/mappe uten navn 2020-11-17 22:06:14.197908 caller_uid=111139, caller_gid=110513{}) currently failed to rdlock, waiting
2020-11-17 22:06:47.523100 [WRN] 1 slow requests, 1 included below; oldest blocked for > 33.325065 secs
2020-11-17 22:06:51.431745 [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
2020-11-17 22:06:20.045030 [INF] Cluster is now healthy
2020-11-17 22:06:20.045008 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2020-11-17 22:06:20.044960 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release)
2020-11-17 22:06:19.062307 [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs
2020-11-17 22:06:19.062253 [INF] MDS health message cleared (mds.?): Client bo-samba-03 failing to respond to capability release
2020-11-17 22:06:15.936150 [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:06:12.522624 [WRN] client.3869410498 isn't responding to mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, sent 64.045677 seconds ago


--thomas

--
Thomas Hukkelberg
thomas@xxxxxxxxxxxxxxxxx
+47 971 81 192
--
support@xxxxxxxxxxxxxxxxx
+47 966 44 999




_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux