Re: mon down for 3 hours after clocksource glitch

Sage Weil <sage@xxxxxxxxxxx> · Thu, 18 Jul 2013 12:29:36 -0700 (PDT)

On Thu, 18 Jul 2013, Dan van der Ster wrote:
> On Thu, Jul 18, 2013 at 6:27 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> > Hi Dan,
> >
> > On Thu, 18 Jul 2013, Dan van der Ster wrote:
> >> Hi,
> >> Last night our cluster became unhealthy for 3 hours after one of the mons (a
> >> qemu-kvm VM) had this glitch:
> >>
> >> Jul 18 00:12:43 andy03 kernel: Clocksource tsc unstable (delta =
> >> -60129537028 ns).  Enable clocksource failover by adding
> >> clocksource_failover kernel parameter.
> >>
> >> shortly afterwords the mon.2 log said:
> >>
> >> 2013-07-18 00:13:01.147770 7feca7a5d700  1 mon.2@1(probing).data_health(52)
> >> service_dispatch not in quorum -- drop message
> >> 2013-07-18 00:13:01.148622 7feca7a5d700  1 mon.2@1(synchronizing sync(
> >> requester state start )) e3 sync_obtain_latest_monmap
> >> 2013-07-18 00:13:01.148786 7feca7a5d700  1 mon.2@1(synchronizing sync(
> >> requester state start )) e3 sync_obtain_latest_monmap obtained monmap e3
> >> 2013-07-18 00:14:01.208569 7feca845e700  0 mon.2@1(synchronizing sync(
> >> requester state chunks )).data_health(52) update_stats avail 59% total
> >> 8547036 used 2993036 avail 5119824
> >> 2013-07-18 00:15:01.298686 7feca845e700  0 mon.2@1(synchronizing sync(
> >> requester state chunks )).data_health(52) update_stats avail 58% total
> >> 8547036 used 3074968 avail 5037892
> >> 2013-07-18 00:16:08.455941 7feca845e700  0 mon.2@1(synchronizing sync(
> >> requester state chunks )).data_health(52) update_stats avail 58% total
> >> 8547036 used 3147732 avail 4965128
> >> ?
> >>
> >> and that continued for over three hours until:
> >>
> >> 2013-07-18 03:32:33.991232 7f334ecd0700  0 mon.2@1(synchronizing sync(
> >> requester state chunks )).data_health(0) update_stats avail 56% total
> >> 8547036 used 3260064 avail 4852796
> >> 2013-07-18 03:33:34.314538 7f334ecd0700  0 mon.2@1(synchronizing sync(
> >> requester state chunks )).data_health(0) update_stats avail 56% total
> >> 8547036 used 3294832 avail 4818028
> >> 2013-07-18 03:34:05.285568 7f334e2cf700  0 log [INF] : mon.2 calling new
> >> monitor election
> >> 2013-07-18 03:34:05.285747 7f334e2cf700  1 mon.2@1(electing).elector(52)
> >> init, last seen epoch 52
> >>
> >> In the meantime I tried restarting each ceph-mon daemon, but that didn't
> >> speed up the recovery.
> >>
> >> We are talking with our OpenStack team to understand if they can provide a
> >> more stable clocksource, but we wanted to better understand why Ceph was so
> >> sensitive to this glitch. It is good that mon.2 managed to recover
> >> eventually, but does anyone have an idea why it took 3 hours??!!
> >
> > A couple of possibilities.  There are some sync improvements in 0.61.5
> > (just out) that can prevent sync loops, which might explain things (hard
> > to say from the log fragments above).  Were there sync start messages in
> > that 3 hour interval, or just the update_stats messages?
> 
> Yes, many of those too. The whole set of relevant logs is here:
> http://pastebin.com/hCcu8Wm9
> The weird bit at  01:13:12 is me telneting to the daemon to check the
> network, so just ignore that. Otherwise, the daemon was restarted
> twice (once after a server rebooted) but I left it after ~1:33.

Okay, this sounds exactly like the problem we just fixed in v0.61.5.  The 
trigger is that sync takes a while (largish mon data directory) and an 
intervening election confuses it enough that it restarts right after it 
finishes. 

> > If the clock skew peristed, the mons won't be able to hold a quorum, but
> > that would happen after the election.  A 60s skew is definitely enough to
> > cause problems...
> 
> I can't say for certain if the skew persisted or not -- but I had the
> impression the clocks were ok shortly after the glitch. After the
> machine rebooted ntp corrected the time by a bit -- not 60s:
> 
> Jul 18 01:40:44 andy03 ntpd[4918]: time reset -0.775446 s
> 
> If the ceph-mon log is useful or not to explain the long outage, that
> would be good to know (especially if it is fixed in 0.61.5). In any
> case, we're moving our mons to physical boxes soon, so the clock
> problem will hopefully go away.

Sounds good.  In this case it looks like sync was to blame.  :)

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com