mon down for 3 hours after clocksource glitch

Dan van der Ster <daniel.vanderster@xxxxxxx> · Thu, 18 Jul 2013 10:09:36 +0200

                    Hi,

Last night our cluster became unhealthy for 3 hours after one of the mons (a qemu-kvm VM) had this glitch:

Jul 18 00:12:43 andy03 kernel: Clocksource tsc unstable (delta = -60129537028 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.

shortly afterwords the mon.2 log said:

2013-07-18 00:13:01.147770 7feca7a5d700  1 mon.2@1(probing).data_health(52) service_dispatch not in quorum -- drop message
2013-07-18 00:13:01.148622 7feca7a5d700  1 mon.2@1(synchronizing sync( requester state start )) e3 sync_obtain_latest_monmap
2013-07-18 00:13:01.148786 7feca7a5d700  1 mon.2@1(synchronizing sync( requester state start )) e3 sync_obtain_latest_monmap obtained monmap e3
2013-07-18 00:14:01.208569 7feca845e700  0 mon.2@1(synchronizing sync( requester state chunks )).data_health(52) update_stats avail 59% total 8547036 used 2993036 avail 5119824
2013-07-18 00:15:01.298686 7feca845e700  0 mon.2@1(synchronizing sync( requester state chunks )).data_health(52) update_stats avail 58% total 8547036 used 3074968 avail 5037892
2013-07-18 00:16:08.455941 7feca845e700  0 mon.2@1(synchronizing sync( requester state chunks )).data_health(52) update_stats avail 58% total 8547036 used 3147732 avail 4965128
…

and that continued for over three hours until:

2013-07-18 03:32:33.991232 7f334ecd0700  0 mon.2@1(synchronizing sync( requester state chunks )).data_health(0) update_stats avail 56% total 8547036 used 3260064 avail 4852796
2013-07-18 03:33:34.314538 7f334ecd0700  0 mon.2@1(synchronizing sync( requester state chunks )).data_health(0) update_stats avail 56% total 8547036 used 3294832 avail 4818028
2013-07-18 03:34:05.285568 7f334e2cf700  0 log [INF] : mon.2 calling new monitor election
2013-07-18 03:34:05.285747 7f334e2cf700  1 mon.2@1(electing).elector(52) init, last seen epoch 52

In the meantime I tried restarting each ceph-mon daemon, but that didn't speed up the recovery.

We are talking with our OpenStack team to understand if they can provide a more stable clocksource, but we wanted to better understand why Ceph was so sensitive to this glitch. It is good that mon.2 managed to recover eventually, but does anyone have an idea why it took 3 hours??!!
thx, Dan

-- 
Dan van der Ster
CERN IT-DSS

            _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com