On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann <christian.eichelmann@xxxxxxxx> wrote: > Hi all, > > during some failover tests and some configuration tests, we currently > discover a strange phenomenon: > > Restarting one of our monitors (5 in sum) triggers about 300 of the > following events: > > osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after > 22.005858 >= grace 20.000000) > > The osds come back up shortly after the have been marked down. What I > don't understand is: How can a restart of one monitor prevent the osds > from talking to each other and marking them down? > > FYI: > We are currently using the following settings: > mon osd adjust hearbeat grace = false > mon osd min down reporters = 20 > mon osd adjust down out interval = false That's really strange. I think maybe you're seeing some kind of secondary effect; what kind of CPU usage are you seeing on the monitors during this time? Have you checked the log on any OSDs which have been marked down? I have a suspicion that maybe the OSDs are detecting their failed monitor connection and not being able to reconnect to another monitor quickly enough, but I'm not certain what the overlaps are there. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com