MDS boot loop

Chris Holcombe <xfactor973@xxxxxxxxx> · Sun, 17 Nov 2013 17:55:04 -0800

I've noticed from time to time that my ceph-mds-a server will get
stuck in a boot loop.  I see log messages like this:

2013-11-17 20:42:42.476334 mon.0 [INF] osdmap e496905: 12 osds: 12 up, 12 in
2013-11-17 20:42:42.566744 mon.0 [INF] mds.? 192.168.1.20:6803/4047 up:boot
2013-11-17 20:42:42.566867 mon.0 [INF] mdsmap e488649: 1/1/1 up
{0=dlceph01=up:active}, 2 up:standby
2013-11-17 20:42:42.644621 mon.0 [INF] pgmap v2247917: 2232 pgs: 2232
active+clean; 23785 MB data, 61294 MB used, 11053 GB / 11113 GB avail;
45180B/s wr, 0op/s
2013-11-17 20:42:43.295421 mon.0 [INF] osdmap e496906: 12 osds: 12 up, 12 in
2013-11-17 20:42:43.371436 mon.0 [INF] mds.? 192.168.1.20:6809/7263 up:boot
2013-11-17 20:42:43.371495 mon.0 [INF] mdsmap e488650: 1/1/1 up
{0=dlceph01=up:active}, 2 up:standby
2013-11-17 20:42:43.475032 mon.0 [INF] pgmap v2247918: 2232 pgs: 2232
active+clean; 23785 MB data, 61294 MB used, 11053 GB / 11113 GB avail
2013-11-17 20:42:43.629813 mon.0 [INF] osdmap e496907: 12 osds: 12 up, 12 in
2013-11-17 20:42:43.697628 mon.0 [INF] mds.? 192.168.1.20:6804/26768 up:boot
2013-11-17 20:42:43.697700 mon.0 [INF] mdsmap e488651: 1/1/1 up
{0=dlceph01=up:active}, 2 up:standby
2013-11-17 20:42:43.772643 mon.0 [INF] pgmap v2247919: 2232 pgs: 2232
active+clean; 23785 MB data, 61294 MB used, 11053 GB / 11113 GB avail
2013-11-17 20:42:44.866154 mon.0 [INF] pgmap v2247920: 2232 pgs: 2232
active+clean; 23785 MB data, 61294 MB used, 11053 GB / 11113 GB avail
2013-11-17 20:42:46.014768 mon.0 [INF] pgmap v2247921: 2232 pgs: 2232
active+clean; 23785 MB data, 61295 MB used, 11053 GB / 11113 GB avail
2013-11-17 20:42:46.484480 mon.0 [INF] osdmap e496908: 12 osds: 12 up, 12 in
2013-11-17 20:42:46.561228 mon.0 [INF] mds.? 192.168.1.20:6803/4047 up:boot
2013-11-17 20:42:46.561327 mon.0 [INF] mdsmap e488652: 1/1/1 up
{0=dlceph01=up:active}, 2 up:standby
2013-11-17 20:42:46.653518 mon.0 [INF] pgmap v2247922: 2232 pgs: 2232
active+clean; 23785 MB data, 61296 MB used, 11053 GB / 11113 GB avail;
44045B/s wr, 0op/s

After watching it do that for a few minutes and cephfs operations
being extremely slow I kill -9 the ceph-mds processes on that host.  A
few seconds later I see a reconnect message and all is fine again.
Any idea why the mds servers are doing this?

I'm running ubuntu 13.04 x64 with the latest dumpling version of ceph.
 I have 3 mds servers, 1 is active and the other 2 are standby.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html