I've heard of the same(?) problem on another cluster; they upgraded from 12.2.7 to 12.2.10 and suddenly got problems with their CephFS (and only with the CephFS). However, they downgraded the MDS to 12.2.8 before I could take a look at it, so not sure what caused the issue. 12.2.8 works fine with the same workload that also involves a relatively large number of files. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Sun, Jan 20, 2019 at 3:26 AM Adam Tygart <mozes@xxxxxxx> wrote: > > Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't happen before then. > > -- > Adam > > On Sat, Jan 19, 2019, 20:17 Paul Emmerich <paul.emmerich@xxxxxxxx wrote: >> >> Did this only start to happen after upgrading to 12.2.10? >> >> Paul >> >> -- >> Paul Emmerich >> >> Looking for help with your Ceph cluster? Contact us at https://croit.io >> >> croit GmbH >> Freseniusstr. 31h >> 81247 München >> www.croit.io >> Tel: +49 89 1896585 90 >> >> On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart <mozes@xxxxxxx> wrote: >> > >> > It worked for about a week, and then seems to have locked up again. >> > >> > Here is the back trace from the threads on the mds: >> > http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt >> > >> > -- >> > Adam >> > >> > On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: >> > > >> > > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart <mozes@xxxxxxx> wrote: >> > > > >> > > > Restarting the nodes causes the hanging again. This means that this is >> > > > workload dependent and not a transient state. >> > > > >> > > > I believe I've tracked down what is happening. One user was running >> > > > 1500-2000 jobs in a single directory with 92000+ files in it. I am >> > > > wondering if the cluster was getting ready to fragment the directory >> > > > something freaked out, perhaps not able to get all the caps back from >> > > > the nodes (if that is even required). >> > > > >> > > > I've stopped that user's jobs for the time being, and will probably >> > > > address it with them Monday. If it is the issue, can I tell the mds to >> > > > pre-fragment the directory before I re-enable their jobs? >> > > > >> > > >> > > The log shows mds is in busy loop, but doesn't show where is it. If it >> > > happens again, please use gdb to attach ceph-mds, then type 'set >> > > logging on' and 'thread apply all bt' inside gdb. and send the output >> > > to us >> > > >> > > Yan, Zheng >> > > > -- >> > > > Adam >> > > > >> > > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart <mozes@xxxxxxx> wrote: >> > > > > >> > > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10 >> > > > > minutes after that restarted the mds daemon. It replayed the journal, >> > > > > evicted the dead compute nodes and is working again. >> > > > > >> > > > > This leads me to believe there was a broken transaction of some kind >> > > > > coming from the compute nodes (also all running CentOS 7.6 and using >> > > > > the kernel cephfs mount). I hope there is enough logging from before >> > > > > to try to track this issue down. >> > > > > >> > > > > We are back up and running for the moment. >> > > > > -- >> > > > > Adam >> > > > > >> > > > > >> > > > > >> > > > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart <mozes@xxxxxxx> wrote: >> > > > > > >> > > > > > Hello all, >> > > > > > >> > > > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6. >> > > > > > >> > > > > > We're using cephfs and rbd. >> > > > > > >> > > > > > Last night, one of our two active/active mds servers went laggy and >> > > > > > upon restart once it goes active it immediately goes laggy again. >> > > > > > >> > > > > > I've got a log available here (debug_mds 20, debug_objecter 20): >> > > > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz >> > > > > > >> > > > > > It looks like I might not have the right log levels. Thoughts on debugging this? >> > > > > > >> > > > > > -- >> > > > > > Adam >> > > > > > _______________________________________________ >> > > > > > ceph-users mailing list >> > > > > > ceph-users@xxxxxxxxxxxxxx >> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > _______________________________________________ >> > > > > ceph-users mailing list >> > > > > ceph-users@xxxxxxxxxxxxxx >> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > _______________________________________________ >> > > > ceph-users mailing list >> > > > ceph-users@xxxxxxxxxxxxxx >> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com