On Mon, Oct 13, 2014 at 1:37 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote: > Hi Greg, > > I took down the interface with "ifconfig p7p1 down". > I attached the config of the first monitor and the first osd. > I created the cluster with ceph-deploy. > The version is ceph version 0.86 (97dcc0539dfa7dac3de74852305d51580b7b1f82). > > On 13.10.2014 21:45, Gregory Farnum wrote: >> How did you test taking down the connection? >> What config options have you specified on the OSDs and in the monitor? >> >> None of the scenarios you're describing make much sense on a >> semi-recent (post-dumpling-release) version of Ceph. > > Best Regards, > martin Hmm, do you have any logs? 120 seconds is just way longer than the failure detection should normally take, unless you've been playing with it enough to stretch out the extra time the monitor waits to be certain. But I did realize that in your configuration you probably want to set one or both of mon_osd_min_down_reporters and mon_osd_min_down_reports to a number greater than the number of OSDs you have on a single host. (They default to 1 and 3, respectively.) That's probably how the disconnected node managed to fail all of the other nodes — it's failure reports to the monitor arrived first. You can also run tests with mon_osd_adjust_heartbeat_grace option set to false, to get more predictable results. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com