On Wed, Apr 1, 2015 at 5:03 AM, Sylvain Munaut <s.munaut@xxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > > > For some unknown reason, periodically, the master is kicked out and > another one becomes leader. And then a couple second later, the > original master calls for re-election and becomes leader again. > > This also seems to cause some load even after the original master is > back. Here's a couple of graphs from the monitor at one such event : > > CPU load: http://i.imgur.com/7byRYhL.png > Memory: http://i.imgur.com/4I0iE0l.png > > I raised the paxos debug to 5 and this is what happens on mon.a & mon.b : > > The short version just around the event: http://pastebin.com/h3AhHhHb > The longer/full logs are available at http://ge.tt/2hMgZTD2 > > > Any explanation of what's happening and how to prevent it ? Notice the "lease timeout" note in mon.b? It's unhappy because the leader didn't update it recently enough that mon.b can keep serving reads, so mon.b called an election on the presumption that mon.a died. And indeed there's nothing in the log for mon.a between 17:49:32.77602 and 17:50:10.929258, which seems not great. I'd look and see if something is happening with your disks, maybe? Based on your graphs, actually, the CPU load you're seeing is probably the cause, not the effect. An election can increase load some if a bunch of client messages get piled up and need to be processed, but otherwise it's just a couple messages and a hiccup in processing... -Greg > > > I can post more info on request. I'm also available on IRC ( nick > 'tnt' ) for live debug if needed :p > > > Cheers, > > Sylvain Munaut > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com