Re: cephfs clients hanging multi mds to single mds

Paul Emmerich <paul.emmerich@xxxxxxxx> · Tue, 2 Oct 2018 12:02:24 +0200

Kernel 4.4 is not suitable for a multi MDS setup. In general, I
wouldn't feel comfortable running 4.4 with kernel cephfs in
production.
I think at least 4.15 (not sure, but definitely > 4.9) is recommended
for multi MDS setups.

If you can't reboot: maybe try cephfs-fuse instead which is usually
very awesome and usually fast enough.

Paul

Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>:
>
> Hi Paul,
>
> we're using 4.4 kernel. Not sure if more recent kernels are stable
>
> for production services. In any case, as there are some production
>
> services running on those servers, rebooting wouldn't be an option
>
> if we can bring ceph clients back without rebooting.
>
> Thanks
>
> Jaime
>
>
> On 01/10/18 21:10, Paul Emmerich wrote:
> > Which kernel version are you using for the kernel cephfs clients?
> > I've seen this problem with "older" kernels (where old is as recent as 4.9)
> >
> > Paul
> > Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>:
> >> Hi all,
> >>
> >> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
> >> multi mds and after few hours
> >>
> >> these errors started showing up
> >>
> >> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
> >> old, received at 2018-09-28 09:40:16.155841:
> >> client_request(client.31059144:8544450 getattr Xs #0$
> >> 100002e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
> >> currently failed to authpin local pins
> >>
> >> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
> >> failing to respond to cache pressure (MDS_CLIENT_RECALL)
> >> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
> >> below; oldest blocked for > 4614.580689 secs
> >> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
> >> old, received at 2018-09-28 10:53:03.203476:
> >> client_request(client.31059144:9080057 lookup #0x100
> >> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
> >> currently initiated
> >> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
> >> failing to respond to capability release; 5 clients failing to respond
> >> to cache pressure; 1 MDSs report slow requests,
> >>
> >> Due to this, we decide to go back to single mds(as it worked before),
> >> however, the clients pointing to mds.1 started hanging, however, the
> >> ones pointing to mds.0 worked fine.
> >>
> >> Then, we tried to enable multi mds again and the clients pointing mds.1
> >> went back online, however the ones pointing to mds.0 stopped work.
> >>
> >> Today, we tried to go back to single mds, however this error was
> >> preventing ceph to disable second active mds(mds.1)
> >>
> >> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
> >> XXXXX: (30108925), after 68213.084174 seconds
> >>
> >> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
> >> stopping state forever due to the above error), we waited for it to
> >> become active again,
> >>
> >> unmount the problematic clients, wait for the cluster to be healthy and
> >> try to go back to single mds again.
> >>
> >> Apparently this worked with some of the clients, we tried to enable
> >> multi mds again to bring faulty clients back again, however no luck this
> >> time
> >>
> >> and some of them are hanging and can't access to ceph fs.
> >>
> >> This is what we have in kern.log
> >>
> >> Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
> >> Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
> >> Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
> >>
> >> Not sure what else can we try to bring hanging clients back without
> >> rebooting as they're in production and rebooting is not an option.
> >>
> >> Does anyone know how can we deal with this, please?
> >>
> >> Thanks
> >>
> >> Jaime
> >>
> >> --
> >>
> >> Jaime Ibar
> >> High Performance & Research Computing, IS Services
> >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> >> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx
> >> Tel: +353-1-896-3725
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
> --
>
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx
> Tel: +353-1-896-3725
>

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com