On Fri, 15 Jun 2012, Amon Ott wrote: > Hello all, > > I have seen this for a long time, but never investigated further. After stable > test runs for several days, this is our last known show stopper before using > Ceph in production. We are running 0.47.2 on 32 Bit. > > If we restart MDS (or all ceph daemons) on all nodes, one after another or all > together, they first recover and then the active one starts to spin with full > cpu and does not answer any more. After a while, the next takes over, starts > to spin, etc., until the whole cluster is unusable. This is completely > reproducable and happens even without any active client. > > As ecpected, ceph -w shows lots of > "2012-06-15 11:35:28.588775 mds e959: 1/1/1 up {0=3=up:active(laggy or > crashed)}" > > It does not help to stop all services on all nodes for minutes or longer and > to restart them - MDS will restart spinning. But: If we reboot the whole > cluster, everything goes back to work. > > Today's MDS log is available at > https://download.m-privacy.de/homeuser-mds.0.log.gz > > Is this a known problem? It has been with us for a looong time now, but since > rebooting used to help, we never tracked it down. I haven't seen this before. Can you attach to the spinning process with gdb and send us a dump of what the threads are doing? 'thread apply all bt'. I opened #2596: http://tracker.newdream.net/issues/2596 Thanks! sage > > Amon Ott > -- > Dr. Amon Ott > m-privacy GmbH Tel: +49 30 24342334 > Am Köllnischen Park 1 Fax: +49 30 24342336 > 10179 Berlin http://www.m-privacy.de > > Amtsgericht Charlottenburg, HRB 84946 > > Geschäftsführer: > Dipl.-Kfm. Holger Maczkowsky, > Roman Maczkowsky > > GnuPG-Key-ID: 0x2DD3A649 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >