Re: Best practice for osd_min_down_reporters

Andrey Korolyov <andrey@xxxxxxx> · Tue, 7 May 2013 17:09:30 +0400

Hi Wido,

I have experienced same problem almost half a year ago, and finally
set this value to 3 - no more wrong marks was given, except extreme
high disk load when OSD really went down for a couple of seconds.

On Tue, May 7, 2013 at 4:59 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> Hi,
>
> I was just upgrading a 9 nodes, 36 OSD cluster running the next branch from
> some days ago to the Cuttlefish release.
>
> While rebooting the nodes one by one and waiting for a active+clean for all
> PGs I noticed that some weird things happened.
>
> I reboot a node and see:
>
> "osdmap e580: 36 osds: 4 up, 36 in"
>
> After a few seconds I see all the OSDs reporting:
>
> osd.33 [WRN] map e582 wrongly marked me down
> osd.5 [WRN] map e582 wrongly marked me down
> osd.6 [WRN] map e582 wrongly marked me down
>
> I didn't check what was happening here, but it seems like the 4 OSDs who
> were shutting down reported everybody but themselves out (Should have
> printed ceph osd tree).
>
> Thinking about that, there is the following configuration option:
>
> OPTION(osd_min_down_reporters, OPT_INT, 1)
> OPTION(osd_min_down_reports, OPT_INT, 3)
>
> So if just one OSD sends 3 reports it can mark anybody in the cluster down,
> right?
>
> Shouldn't the best practice be to set osd_min_down_reporters to at least
> numosdperhost+1
>
> In this case I have 4 OSDs per host, so shouldn't I use 5 here?
>
> This might as well be a bug, but it still doesn't seem right that all the
> OSDs on one machine can mark the whole cluster down.
>
> --
> Wido den Hollander
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com