cephfs clients hanging multi mds to single mds

Jaime Ibar <jaime@xxxxxxxxxxxx> · Mon, 1 Oct 2018 17:34:44 +0100

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled 
multi mds and after few hours

these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
100002e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) 
currently failed to authpin local pins

2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds 
old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to respond 
to cache pressure; 1 MDSs report slow requests,

Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.

Then, we tried to enable multi mds again and the clients pointing mds.1 
went back online, however the ones pointing to mds.0 stopped work.

Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)

2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
XXXXX: (30108925), after 68213.084174 seconds

After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,

unmount the problematic clients, wait for the cluster to be healthy and 
try to go back to single mds again.

Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck this 
time

and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.

Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx
Tel: +353-1-896-3725

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com