Hi, > And indeed there's nothing in the log for mon.a between 17:49:32.77602 > and 17:50:10.929258, which seems not great. I'd look and see if > something is happening with your disks, maybe? Mmm, indeed. I had checked all the disk with SMART and the RAID controller wasn't reporting any as failed, but digging deeper I managed to find a log that one of the two disk in the RAID-1 that stores the monitor data has had quite a few "aborted command". I just swapped that disk, I'll see if this completely fix the issue. > Based on your graphs, actually, the CPU load you're seeing is probably > the cause, not the effect. An election can increase load some if a > bunch of client messages get piled up and need to be processed, but > otherwise it's just a couple messages and a hiccup in processing... CPU is definitely not the cause. If you look at the log of mon.b when it becomes leaders, you get things like : 2015-03-28 17:49:58.221540 7fc1e9fed700 5 mon.b@1(leader).osd e19301 send_incremental [19286..19301] to osd.8 10.208.2.213:6814/15970 2015-03-28 17:49:58.221551 7fc1e9fed700 5 mon.b@1(leader).osd e19301 send_latest to osd.8 10.208.2.213:6814/15970 start 19286 2015-03-28 17:49:58.221554 7fc1e9fed700 5 mon.b@1(leader).osd e19301 send_incremental [19286..19301] to osd.8 10.208.2.213:6814/15970 And you get _a_lot_ of these. There is like 23000 of theses emitted in less than 250 micro-seconds. Cheers, Sylvain _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com