The kernel cephfs client unfortunately has the tendency to get stuck in some unrecoverable states requiring a reboot, especially in older kernels. Usually it's not recoverable without a reboot. Paul Am Di., 2. Okt. 2018 um 14:55 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>: > > Hi Paul, > > I tried ceph-fuse mounting it in a different mount point and it worked. > > The problem here is we can't unmount ceph kernel client as it is in use > > by some virsh processes. We forced the unmount and mount ceph-fuse > > but we got an I/O error and mount -l cleared all the processes but after > > rebooting the vm's they didn't come back and a server reboot was needed. > > Not sure how can I restore mds session or remounting cephfs keeping > > all processes. > > Thanks a lot for your help. > > Jaime > > > On 02/10/18 11:02, Paul Emmerich wrote: > > Kernel 4.4 is not suitable for a multi MDS setup. In general, I > > wouldn't feel comfortable running 4.4 with kernel cephfs in > > production. > > I think at least 4.15 (not sure, but definitely > 4.9) is recommended > > for multi MDS setups. > > > > If you can't reboot: maybe try cephfs-fuse instead which is usually > > very awesome and usually fast enough. > > > > Paul > > > > Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>: > >> Hi Paul, > >> > >> we're using 4.4 kernel. Not sure if more recent kernels are stable > >> > >> for production services. In any case, as there are some production > >> > >> services running on those servers, rebooting wouldn't be an option > >> > >> if we can bring ceph clients back without rebooting. > >> > >> Thanks > >> > >> Jaime > >> > >> > >> On 01/10/18 21:10, Paul Emmerich wrote: > >>> Which kernel version are you using for the kernel cephfs clients? > >>> I've seen this problem with "older" kernels (where old is as recent as 4.9) > >>> > >>> Paul > >>> Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar <jaime@xxxxxxxxxxxx>: > >>>> Hi all, > >>>> > >>>> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled > >>>> multi mds and after few hours > >>>> > >>>> these errors started showing up > >>>> > >>>> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds > >>>> old, received at 2018-09-28 09:40:16.155841: > >>>> client_request(client.31059144:8544450 getattr Xs #0$ > >>>> 100002e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) > >>>> currently failed to authpin local pins > >>>> > >>>> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients > >>>> failing to respond to cache pressure (MDS_CLIENT_RECALL) > >>>> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included > >>>> below; oldest blocked for > 4614.580689 secs > >>>> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds > >>>> old, received at 2018-09-28 10:53:03.203476: > >>>> client_request(client.31059144:9080057 lookup #0x100 > >>>> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) > >>>> currently initiated > >>>> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients > >>>> failing to respond to capability release; 5 clients failing to respond > >>>> to cache pressure; 1 MDSs report slow requests, > >>>> > >>>> Due to this, we decide to go back to single mds(as it worked before), > >>>> however, the clients pointing to mds.1 started hanging, however, the > >>>> ones pointing to mds.0 worked fine. > >>>> > >>>> Then, we tried to enable multi mds again and the clients pointing mds.1 > >>>> went back online, however the ones pointing to mds.0 stopped work. > >>>> > >>>> Today, we tried to go back to single mds, however this error was > >>>> preventing ceph to disable second active mds(mds.1) > >>>> > >>>> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client > >>>> XXXXX: (30108925), after 68213.084174 seconds > >>>> > >>>> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in > >>>> stopping state forever due to the above error), we waited for it to > >>>> become active again, > >>>> > >>>> unmount the problematic clients, wait for the cluster to be healthy and > >>>> try to go back to single mds again. > >>>> > >>>> Apparently this worked with some of the clients, we tried to enable > >>>> multi mds again to bring faulty clients back again, however no luck this > >>>> time > >>>> > >>>> and some of them are hanging and can't access to ceph fs. > >>>> > >>>> This is what we have in kern.log > >>>> > >>>> Oct 1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start > >>>> Oct 1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success > >>>> Oct 1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed > >>>> > >>>> Not sure what else can we try to bring hanging clients back without > >>>> rebooting as they're in production and rebooting is not an option. > >>>> > >>>> Does anyone know how can we deal with this, please? > >>>> > >>>> Thanks > >>>> > >>>> Jaime > >>>> > >>>> -- > >>>> > >>>> Jaime Ibar > >>>> High Performance & Research Computing, IS Services > >>>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > >>>> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx > >>>> Tel: +353-1-896-3725 > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >> -- > >> > >> Jaime Ibar > >> High Performance & Research Computing, IS Services > >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > >> http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx > >> Tel: +353-1-896-3725 > >> > > > > -- > > Jaime Ibar > High Performance & Research Computing, IS Services > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > http://www.tchpc.tcd.ie/ | jaime@xxxxxxxxxxxx > Tel: +353-1-896-3725 > -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com