Hi Stefan, Sorry I couldn't get back to you sooner. On Mon, May 27, 2019 at 5:02 AM Stefan Kooman <stefan@xxxxxx> wrote: > > Quoting Stefan Kooman (stefan@xxxxxx): > > Hi Patrick, > > > > Quoting Stefan Kooman (stefan@xxxxxx): > > > Quoting Stefan Kooman (stefan@xxxxxx): > > > > Quoting Patrick Donnelly (pdonnell@xxxxxxxxxx): > > > > > Thanks for the detailed notes. It looks like the MDS is stuck > > > > > somewhere it's not even outputting any log messages. If possible, it'd > > > > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or, > > > > > if you're comfortable with gdb, a backtrace of any threads that look > > > > > suspicious (e.g. not waiting on a futex) including `info threads`. > > > > > > Today the issue reappeared (after being absent for ~ 3 weeks). This time > > > the standby MDS could take over and would not get into a deadlock > > > itself. We made gdb traces again, which you can find over here: > > > > > > https://8n1.org/14011/d444 > > > > We are still seeing these crashes occur ~ every 3 weeks or so. Have you > > find the time to look into the backtraces / gdb dumps? > > We have not seen this issue anymore for the past three months. We have > updated the cluster to 12.2.11 in the meantime, but not sure if that is > related. Hopefully it stays away. Looks like you hit the infinite loop bug in OpTracker. It was fixed in 12.2.11: https://tracker.ceph.com/issues/37977 The problem was introduced in 12.2.8. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com