Kernel 4.4 is not suitable for a multi MDS setup. In general, I wouldn't feel comfortable running 4.4 with kernel cephfs in production. I think at least 4.15 (not sure, but definitely > 4.9) is recommended for multi MDS setups. If you can't reboot: maybe try cephfs-fuse instead which is usually very awesome and usually fast enough. Paul Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>: > > Hi Paul, > > we're using 4.4 kernel. Not sure if more recent kernels are stable > > for production services. In any case, as there are some production > > services running on those servers, rebooting wouldn't be an option > > if we can bring ceph clients back without rebooting. > > Thanks > > Jaime > > > On 01/10/18 21:10, Paul Emmerich wrote: > > Which kernel version are you using for the kernel cephfs clients? > > I've seen this problem with "older" kernels (where old is as recent as 4.9) > > > > Paul > > Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>: > >> Hi all, > >> > >> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled > >> multi mds and after few hours > >> > >> these errors started showing up > >> > >> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds > >> old, received at 2018-09-28 09:40:16.155841: > >> client_request(client.31059144:8544450 getattr Xs #0$ > >> 100002e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) > >> currently failed to authpin local pins > >> > >> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients > >> failing to respond to cache pressure (MDS_CLIENT_RECALL) > >> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included > >> below; oldest blocked for > 4614.580689 secs > >> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds > >> old, received at 2018-09-28 10:53:03.203476: > >> client_request(client.31059144:9080057 lookup #0x100 > >> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) > >> currently initiated > >> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients > >> failing to respond to capability release; 5 clients failing to respond > >> to cache pressure; 1 MDSs report slow requests, > >> > >> Due to this, we decide to go back to single mds(as it worked before), > >> however, the clients pointing to mds.1 started hanging, however, the > >> ones pointing to mds.0 worked fine. > >> > >> Then, we tried to enable multi mds again and the clients pointing mds.1 > >> went back online, however the ones pointing to mds.0 stopped work. > >> > >> Today, we tried to go back to single mds, however this error was > >> preventing ceph to disable second active mds(mds.1) > >> > >> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client > >> XXXXX: (30108925), after 68213.084174 seconds > >> > >> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in > >> stopping state forever due to the above error), we waited for it to > >> become active again, > >> > >> unmount the problematic clients, wait for the cluster to be healthy and > >> try to go back to single mds again. > >> > >> Apparently this worked with some of the clients, we tried to enable > >> multi mds again to bring faulty clients back again, however no luck this > >> time > >> > >> and some of them are hanging and can't access to ceph fs. > >> > >> This is what we have in kern.log > >> > >> Oct 1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start > >> Oct 1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success > >> Oct 1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed > >> > >> Not sure what else can we try to bring hanging clients back without > >> rebooting as they're in production and rebooting is not an option. > >> > >> Does anyone know how can we deal with this, please? > >> > >> Thanks > >> > >> Jaime > >> > >> -- > >> > >> Jaime Ibar > >> High Performance & Research Computing, IS Services > >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > >> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx > >> Tel: +353-1-896-3725 > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > > Jaime Ibar > High Performance & Research Computing, IS Services > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx > Tel: +353-1-896-3725 > -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com