On Thu, 18 Jul 2013, Dan van der Ster wrote: > On Thu, Jul 18, 2013 at 6:27 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > > Hi Dan, > > > > On Thu, 18 Jul 2013, Dan van der Ster wrote: > >> Hi, > >> Last night our cluster became unhealthy for 3 hours after one of the mons (a > >> qemu-kvm VM) had this glitch: > >> > >> Jul 18 00:12:43 andy03 kernel: Clocksource tsc unstable (delta = > >> -60129537028 ns). Enable clocksource failover by adding > >> clocksource_failover kernel parameter. > >> > >> shortly afterwords the mon.2 log said: > >> > >> 2013-07-18 00:13:01.147770 7feca7a5d700 1 mon.2@1(probing).data_health(52) > >> service_dispatch not in quorum -- drop message > >> 2013-07-18 00:13:01.148622 7feca7a5d700 1 mon.2@1(synchronizing sync( > >> requester state start )) e3 sync_obtain_latest_monmap > >> 2013-07-18 00:13:01.148786 7feca7a5d700 1 mon.2@1(synchronizing sync( > >> requester state start )) e3 sync_obtain_latest_monmap obtained monmap e3 > >> 2013-07-18 00:14:01.208569 7feca845e700 0 mon.2@1(synchronizing sync( > >> requester state chunks )).data_health(52) update_stats avail 59% total > >> 8547036 used 2993036 avail 5119824 > >> 2013-07-18 00:15:01.298686 7feca845e700 0 mon.2@1(synchronizing sync( > >> requester state chunks )).data_health(52) update_stats avail 58% total > >> 8547036 used 3074968 avail 5037892 > >> 2013-07-18 00:16:08.455941 7feca845e700 0 mon.2@1(synchronizing sync( > >> requester state chunks )).data_health(52) update_stats avail 58% total > >> 8547036 used 3147732 avail 4965128 > >> ? > >> > >> and that continued for over three hours until: > >> > >> 2013-07-18 03:32:33.991232 7f334ecd0700 0 mon.2@1(synchronizing sync( > >> requester state chunks )).data_health(0) update_stats avail 56% total > >> 8547036 used 3260064 avail 4852796 > >> 2013-07-18 03:33:34.314538 7f334ecd0700 0 mon.2@1(synchronizing sync( > >> requester state chunks )).data_health(0) update_stats avail 56% total > >> 8547036 used 3294832 avail 4818028 > >> 2013-07-18 03:34:05.285568 7f334e2cf700 0 log [INF] : mon.2 calling new > >> monitor election > >> 2013-07-18 03:34:05.285747 7f334e2cf700 1 mon.2@1(electing).elector(52) > >> init, last seen epoch 52 > >> > >> In the meantime I tried restarting each ceph-mon daemon, but that didn't > >> speed up the recovery. > >> > >> We are talking with our OpenStack team to understand if they can provide a > >> more stable clocksource, but we wanted to better understand why Ceph was so > >> sensitive to this glitch. It is good that mon.2 managed to recover > >> eventually, but does anyone have an idea why it took 3 hours??!! > > > > A couple of possibilities. There are some sync improvements in 0.61.5 > > (just out) that can prevent sync loops, which might explain things (hard > > to say from the log fragments above). Were there sync start messages in > > that 3 hour interval, or just the update_stats messages? > > Yes, many of those too. The whole set of relevant logs is here: > http://pastebin.com/hCcu8Wm9 > The weird bit at 01:13:12 is me telneting to the daemon to check the > network, so just ignore that. Otherwise, the daemon was restarted > twice (once after a server rebooted) but I left it after ~1:33. Okay, this sounds exactly like the problem we just fixed in v0.61.5. The trigger is that sync takes a while (largish mon data directory) and an intervening election confuses it enough that it restarts right after it finishes. > > If the clock skew peristed, the mons won't be able to hold a quorum, but > > that would happen after the election. A 60s skew is definitely enough to > > cause problems... > > I can't say for certain if the skew persisted or not -- but I had the > impression the clocks were ok shortly after the glitch. After the > machine rebooted ntp corrected the time by a bit -- not 60s: > > Jul 18 01:40:44 andy03 ntpd[4918]: time reset -0.775446 s > > If the ceph-mon log is useful or not to explain the long outage, that > would be good to know (especially if it is fixed in 0.61.5). In any > case, we're moving our mons to physical boxes soon, so the clock > problem will hopefully go away. Sounds good. In this case it looks like sync was to blame. :) sage _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com