Re: Monitor Restart triggers half of our OSDs marked down

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 3 Feb 2015 09:10:31 -0800

On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann
<christian.eichelmann@xxxxxxxx> wrote:
> Hi all,
>
> during some failover tests and some configuration tests, we currently
> discover a strange phenomenon:
>
> Restarting one of our monitors (5 in sum) triggers about 300 of the
> following events:
>
> osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after
> 22.005858 >= grace 20.000000)
>
> The osds come back up shortly after the have been marked down. What I
> don't understand is: How can a restart of one monitor prevent the osds
> from talking to each other and marking them down?
>
> FYI:
> We are currently using the following settings:
> mon osd adjust hearbeat grace = false
> mon osd min down reporters = 20
> mon osd adjust down out interval = false

That's really strange. I think maybe you're seeing some kind of
secondary effect; what kind of CPU usage are you seeing on the
monitors during this time? Have you checked the log on any OSDs which
have been marked down?

I have a suspicion that maybe the OSDs are detecting their failed
monitor connection and not being able to reconnect to another monitor
quickly enough, but I'm not certain what the overlaps are there.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com