On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart <mozes@xxxxxxx> wrote: > > Restarting the nodes causes the hanging again. This means that this is > workload dependent and not a transient state. > > I believe I've tracked down what is happening. One user was running > 1500-2000 jobs in a single directory with 92000+ files in it. I am > wondering if the cluster was getting ready to fragment the directory > something freaked out, perhaps not able to get all the caps back from > the nodes (if that is even required). > > I've stopped that user's jobs for the time being, and will probably > address it with them Monday. If it is the issue, can I tell the mds to > pre-fragment the directory before I re-enable their jobs? > The log shows mds is in busy loop, but doesn't show where is it. If it happens again, please use gdb to attach ceph-mds, then type 'set logging on' and 'thread apply all bt' inside gdb. and send the output to us Yan, Zheng > -- > Adam > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart <mozes@xxxxxxx> wrote: > > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10 > > minutes after that restarted the mds daemon. It replayed the journal, > > evicted the dead compute nodes and is working again. > > > > This leads me to believe there was a broken transaction of some kind > > coming from the compute nodes (also all running CentOS 7.6 and using > > the kernel cephfs mount). I hope there is enough logging from before > > to try to track this issue down. > > > > We are back up and running for the moment. > > -- > > Adam > > > > > > > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart <mozes@xxxxxxx> wrote: > > > > > > Hello all, > > > > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6. > > > > > > We're using cephfs and rbd. > > > > > > Last night, one of our two active/active mds servers went laggy and > > > upon restart once it goes active it immediately goes laggy again. > > > > > > I've got a log available here (debug_mds 20, debug_objecter 20): > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz > > > > > > It looks like I might not have the right log levels. Thoughts on debugging this? > > > > > > -- > > > Adam > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com