Re: Handling of network failures in the cluster network

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 14 Oct 2014 14:32:15 -0700

On Mon, Oct 13, 2014 at 1:37 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
> Hi Greg,
>
> I took down the interface with "ifconfig p7p1 down".
> I attached the config of the first monitor and the first osd.
> I created the cluster with ceph-deploy.
> The version is ceph version 0.86 (97dcc0539dfa7dac3de74852305d51580b7b1f82).
>
> On 13.10.2014 21:45, Gregory Farnum wrote:
>> How did you test taking down the connection?
>> What config options have you specified on the OSDs and in the monitor?
>>
>> None of the scenarios you're describing make much sense on a
>> semi-recent (post-dumpling-release) version of Ceph.
>
> Best Regards,
>  martin

Hmm, do you have any logs?
120 seconds is just way longer than the failure detection should
normally take, unless you've been playing with it enough to stretch
out the extra time the monitor waits to be certain.

But I did realize that in your configuration you probably want to set
one or both of mon_osd_min_down_reporters and mon_osd_min_down_reports
to a number greater than the number of OSDs you have on a single host.
(They default to 1 and 3, respectively.) That's probably how the
disconnected node managed to fail all of the other nodes — it's
failure reports to the monitor arrived first.

You can also run tests with mon_osd_adjust_heartbeat_grace option set
to false, to get more predictable results.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com