On a hunch, I shutdown the compute nodes for our HPC cluster, and 10 minutes after that restarted the mds daemon. It replayed the journal, evicted the dead compute nodes and is working again. This leads me to believe there was a broken transaction of some kind coming from the compute nodes (also all running CentOS 7.6 and using the kernel cephfs mount). I hope there is enough logging from before to try to track this issue down. We are back up and running for the moment. -- Adam On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart <mozes@xxxxxxx> wrote: > > Hello all, > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6. > > We're using cephfs and rbd. > > Last night, one of our two active/active mds servers went laggy and > upon restart once it goes active it immediately goes laggy again. > > I've got a log available here (debug_mds 20, debug_objecter 20): > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz > > It looks like I might not have the right log levels. Thoughts on debugging this? > > -- > Adam > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com