Hi Wido, I have experienced same problem almost half a year ago, and finally set this value to 3 - no more wrong marks was given, except extreme high disk load when OSD really went down for a couple of seconds. On Tue, May 7, 2013 at 4:59 PM, Wido den Hollander <wido@xxxxxxxx> wrote: > Hi, > > I was just upgrading a 9 nodes, 36 OSD cluster running the next branch from > some days ago to the Cuttlefish release. > > While rebooting the nodes one by one and waiting for a active+clean for all > PGs I noticed that some weird things happened. > > I reboot a node and see: > > "osdmap e580: 36 osds: 4 up, 36 in" > > After a few seconds I see all the OSDs reporting: > > osd.33 [WRN] map e582 wrongly marked me down > osd.5 [WRN] map e582 wrongly marked me down > osd.6 [WRN] map e582 wrongly marked me down > > I didn't check what was happening here, but it seems like the 4 OSDs who > were shutting down reported everybody but themselves out (Should have > printed ceph osd tree). > > Thinking about that, there is the following configuration option: > > OPTION(osd_min_down_reporters, OPT_INT, 1) > OPTION(osd_min_down_reports, OPT_INT, 3) > > So if just one OSD sends 3 reports it can mark anybody in the cluster down, > right? > > Shouldn't the best practice be to set osd_min_down_reporters to at least > numosdperhost+1 > > In this case I have 4 OSDs per host, so shouldn't I use 5 here? > > This might as well be a bug, but it still doesn't seem right that all the > OSDs on one machine can mark the whole cluster down. > > -- > Wido den Hollander > 42on B.V. > > Phone: +31 (0)20 700 9902 > Skype: contact42on > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com