Re: cephfs clients hanging multi mds to single mds

Paul Emmerich <paul.emmerich@xxxxxxxx> · Tue, 2 Oct 2018 14:59:12 +0200

The kernel cephfs client unfortunately has the tendency to get stuck
in some unrecoverable states requiring a reboot, especially in older
kernels.
Usually it's not recoverable without a reboot.

Paul
Am Di., 2. Okt. 2018 um 14:55 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>:
>
> Hi Paul,
>
> I tried ceph-fuse mounting it in a different mount point and it worked.
>
> The problem here is we can't unmount ceph kernel client as it is in use
>
> by some virsh processes. We forced the unmount and mount ceph-fuse
>
> but we got an I/O error and mount -l cleared all the processes but after
>
> rebooting the vm's they didn't come back and a server reboot was needed.
>
> Not sure how can I restore mds session or remounting cephfs keeping
>
> all processes.
>
> Thanks a lot for your help.
>
> Jaime
>
>
> On 02/10/18 11:02, Paul Emmerich wrote:
> > Kernel 4.4 is not suitable for a multi MDS setup. In general, I
> > wouldn't feel comfortable running 4.4 with kernel cephfs in
> > production.
> > I think at least 4.15 (not sure, but definitely > 4.9) is recommended
> > for multi MDS setups.
> >
> > If you can't reboot: maybe try cephfs-fuse instead which is usually
> > very awesome and usually fast enough.
> >
> > Paul
> >
> > Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>:
> >> Hi Paul,
> >>
> >> we're using 4.4 kernel. Not sure if more recent kernels are stable
> >>
> >> for production services. In any case, as there are some production
> >>
> >> services running on those servers, rebooting wouldn't be an option
> >>
> >> if we can bring ceph clients back without rebooting.
> >>
> >> Thanks
> >>
> >> Jaime
> >>
> >>
> >> On 01/10/18 21:10, Paul Emmerich wrote:
> >>> Which kernel version are you using for the kernel cephfs clients?
> >>> I've seen this problem with "older" kernels (where old is as recent as 4.9)
> >>>
> >>> Paul
> >>> Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>:
> >>>> Hi all,
> >>>>
> >>>> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
> >>>> multi mds and after few hours
> >>>>
> >>>> these errors started showing up
> >>>>
> >>>> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
> >>>> old, received at 2018-09-28 09:40:16.155841:
> >>>> client_request(client.31059144:8544450 getattr Xs #0$
> >>>> 100002e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
> >>>> currently failed to authpin local pins
> >>>>
> >>>> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
> >>>> failing to respond to cache pressure (MDS_CLIENT_RECALL)
> >>>> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
> >>>> below; oldest blocked for > 4614.580689 secs
> >>>> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
> >>>> old, received at 2018-09-28 10:53:03.203476:
> >>>> client_request(client.31059144:9080057 lookup #0x100
> >>>> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
> >>>> currently initiated
> >>>> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
> >>>> failing to respond to capability release; 5 clients failing to respond
> >>>> to cache pressure; 1 MDSs report slow requests,
> >>>>
> >>>> Due to this, we decide to go back to single mds(as it worked before),
> >>>> however, the clients pointing to mds.1 started hanging, however, the
> >>>> ones pointing to mds.0 worked fine.
> >>>>
> >>>> Then, we tried to enable multi mds again and the clients pointing mds.1
> >>>> went back online, however the ones pointing to mds.0 stopped work.
> >>>>
> >>>> Today, we tried to go back to single mds, however this error was
> >>>> preventing ceph to disable second active mds(mds.1)
> >>>>
> >>>> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
> >>>> XXXXX: (30108925), after 68213.084174 seconds
> >>>>
> >>>> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
> >>>> stopping state forever due to the above error), we waited for it to
> >>>> become active again,
> >>>>
> >>>> unmount the problematic clients, wait for the cluster to be healthy and
> >>>> try to go back to single mds again.
> >>>>
> >>>> Apparently this worked with some of the clients, we tried to enable
> >>>> multi mds again to bring faulty clients back again, however no luck this
> >>>> time
> >>>>
> >>>> and some of them are hanging and can't access to ceph fs.
> >>>>
> >>>> This is what we have in kern.log
> >>>>
> >>>> Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
> >>>> Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
> >>>> Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
> >>>>
> >>>> Not sure what else can we try to bring hanging clients back without
> >>>> rebooting as they're in production and rebooting is not an option.
> >>>>
> >>>> Does anyone know how can we deal with this, please?
> >>>>
> >>>> Thanks
> >>>>
> >>>> Jaime
> >>>>
> >>>> --
> >>>>
> >>>> Jaime Ibar
> >>>> High Performance & Research Computing, IS Services
> >>>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> >>>> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx
> >>>> Tel: +353-1-896-3725
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> --
> >>
> >> Jaime Ibar
> >> High Performance & Research Computing, IS Services
> >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> >> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx
> >> Tel: +353-1-896-3725
> >>
> >
>
> --
>
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx
> Tel: +353-1-896-3725
>

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com