Re: Ceph MDS laggy

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 14 Jan 2019 09:41:10 +0800

On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart <mozes@xxxxxxx> wrote:
>
> Restarting the nodes causes the hanging again. This means that this is
> workload dependent and not a transient state.
>
> I believe I've tracked down what is happening. One user was running
> 1500-2000 jobs in a single directory with 92000+ files in it. I am
> wondering if the cluster was getting ready to fragment the directory
> something freaked out, perhaps not able to get all the caps back from
> the nodes (if that is even required).
>
> I've stopped that user's jobs for the time being, and will probably
> address it with them Monday. If it is the issue, can I tell the mds to
> pre-fragment the directory before I re-enable their jobs?
>

The log shows mds is in busy loop, but doesn't show where is it. If it
happens again, please use gdb to attach ceph-mds, then type 'set
logging on' and 'thread apply all bt' inside gdb. and send the output
to us

Yan, Zheng
> --
> Adam
>
> On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart <mozes@xxxxxxx> wrote:
> >
> > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
> > minutes after that restarted the mds daemon. It replayed the journal,
> > evicted the dead compute nodes and is working again.
> >
> > This leads me to believe there was a broken transaction of some kind
> > coming from the compute nodes (also all running CentOS 7.6 and using
> > the kernel cephfs mount). I hope there is enough logging from before
> > to try to track this issue down.
> >
> > We are back up and running for the moment.
> > --
> > Adam
> >
> >
> >
> > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart <mozes@xxxxxxx> wrote:
> > >
> > > Hello all,
> > >
> > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6.
> > >
> > > We're using cephfs and rbd.
> > >
> > > Last night, one of our two active/active mds servers went laggy and
> > > upon restart once it goes active it immediately goes laggy again.
> > >
> > > I've got a log available here (debug_mds 20, debug_objecter 20):
> > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
> > >
> > > It looks like I might not have the right log levels. Thoughts on debugging this?
> > >
> > > --
> > > Adam
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com