Re: MDS hangs in "heartbeat_map" deadlock

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Fri, 31 May 2019 12:47:35 -0700

Hi Stefan,

Sorry I couldn't get back to you sooner.

On Mon, May 27, 2019 at 5:02 AM Stefan Kooman <stefan@xxxxxx> wrote:
>
> Quoting Stefan Kooman (stefan@xxxxxx):
> > Hi Patrick,
> >
> > Quoting Stefan Kooman (stefan@xxxxxx):
> > > Quoting Stefan Kooman (stefan@xxxxxx):
> > > > Quoting Patrick Donnelly (pdonnell@xxxxxxxxxx):
> > > > > Thanks for the detailed notes. It looks like the MDS is stuck
> > > > > somewhere it's not even outputting any log messages. If possible, it'd
> > > > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > > > > if you're comfortable with gdb, a backtrace of any threads that look
> > > > > suspicious (e.g. not waiting on a futex) including `info threads`.
> > >
> > > Today the issue reappeared (after being absent for ~ 3 weeks). This time
> > > the standby MDS could take over and would not get into a deadlock
> > > itself. We made gdb traces again, which you can find over here:
> > >
> > > https://8n1.org/14011/d444
> >
> > We are still seeing these crashes occur ~ every 3 weeks or so. Have you
> > find the time to look into the backtraces / gdb dumps?
>
> We have not seen this issue anymore for the past three months. We have
> updated the cluster to 12.2.11 in the meantime, but not sure if that is
> related. Hopefully it stays away.

Looks like you hit the infinite loop bug in OpTracker. It was fixed in
12.2.11: https://tracker.ceph.com/issues/37977

The problem was introduced in 12.2.8.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com