Re: mon down for 3 hours after clocksource glitch

Sage Weil <sage@xxxxxxxxxxx> · Thu, 18 Jul 2013 09:27:40 -0700 (PDT)

Hi Dan,

On Thu, 18 Jul 2013, Dan van der Ster wrote:
> Hi,
> Last night our cluster became unhealthy for 3 hours after one of the mons (a
> qemu-kvm VM) had this glitch:
> 
> Jul 18 00:12:43 andy03 kernel: Clocksource tsc unstable (delta =
> -60129537028 ns).  Enable clocksource failover by adding
> clocksource_failover kernel parameter.
> 
> shortly afterwords the mon.2 log said:
> 
> 2013-07-18 00:13:01.147770 7feca7a5d700  1 mon.2@1(probing).data_health(52)
> service_dispatch not in quorum -- drop message
> 2013-07-18 00:13:01.148622 7feca7a5d700  1 mon.2@1(synchronizing sync(
> requester state start )) e3 sync_obtain_latest_monmap
> 2013-07-18 00:13:01.148786 7feca7a5d700  1 mon.2@1(synchronizing sync(
> requester state start )) e3 sync_obtain_latest_monmap obtained monmap e3
> 2013-07-18 00:14:01.208569 7feca845e700  0 mon.2@1(synchronizing sync(
> requester state chunks )).data_health(52) update_stats avail 59% total
> 8547036 used 2993036 avail 5119824
> 2013-07-18 00:15:01.298686 7feca845e700  0 mon.2@1(synchronizing sync(
> requester state chunks )).data_health(52) update_stats avail 58% total
> 8547036 used 3074968 avail 5037892
> 2013-07-18 00:16:08.455941 7feca845e700  0 mon.2@1(synchronizing sync(
> requester state chunks )).data_health(52) update_stats avail 58% total
> 8547036 used 3147732 avail 4965128
> ?
> 
> and that continued for over three hours until:
> 
> 2013-07-18 03:32:33.991232 7f334ecd0700  0 mon.2@1(synchronizing sync(
> requester state chunks )).data_health(0) update_stats avail 56% total
> 8547036 used 3260064 avail 4852796
> 2013-07-18 03:33:34.314538 7f334ecd0700  0 mon.2@1(synchronizing sync(
> requester state chunks )).data_health(0) update_stats avail 56% total
> 8547036 used 3294832 avail 4818028
> 2013-07-18 03:34:05.285568 7f334e2cf700  0 log [INF] : mon.2 calling new
> monitor election
> 2013-07-18 03:34:05.285747 7f334e2cf700  1 mon.2@1(electing).elector(52)
> init, last seen epoch 52
> 
> In the meantime I tried restarting each ceph-mon daemon, but that didn't
> speed up the recovery.
>
> We are talking with our OpenStack team to understand if they can provide a
> more stable clocksource, but we wanted to better understand why Ceph was so
> sensitive to this glitch. It is good that mon.2 managed to recover
> eventually, but does anyone have an idea why it took 3 hours??!!

A couple of possibilities.  There are some sync improvements in 0.61.5 
(just out) that can prevent sync loops, which might explain things (hard 
to say from the log fragments above).  Were there sync start messages in 
that 3 hour interval, or just the update_stats messages?

If the clock skew peristed, the mons won't be able to hold a quorum, but 
that would happen after the election.  A 60s skew is definitely enough to 
cause problems...

sage

> thx, Dan
> 
> -- 
> Dan van der Ster
> CERN IT-DSS
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com