Hi,
I was just upgrading a 9 nodes, 36 OSD cluster running the next branch from some days ago to the Cuttlefish release.
While rebooting the nodes one by one and waiting for a active+clean for all PGs I noticed that some weird things happened.
I reboot a node and see:
"osdmap e580: 36 osds: 4 up, 36 in"
After a few seconds I see all the OSDs reporting:
osd.33 [WRN] map e582 wrongly marked me down
osd.5 [WRN] map e582 wrongly marked me down
osd.6 [WRN] map e582 wrongly marked me down
I didn't check what was happening here, but it seems like the 4 OSDs who were shutting down reported everybody but themselves out (Should have printed ceph osd tree).
Thinking about that, there is the following configuration option:
OPTION(osd_min_down_reporters, OPT_INT, 1)
OPTION(osd_min_down_reports, OPT_INT, 3)
So if just one OSD sends 3 reports it can mark anybody in the cluster down, right?
Shouldn't the best practice be to set osd_min_down_reporters to at least numosdperhost+1
In this case I have 4 OSDs per host, so shouldn't I use 5 here?
This might as well be a bug, but it still doesn't seem right that all the OSDs on one machine can mark the whole cluster down.
I'm a little surprised tha OSDs turning off could have marked anybody down at all. :/ Do you have any more info?
In any case, yeah, you probably want to increase your "reporters" required. That value is set at 1 so it works on a 2-node cluster. :)
-Greg
--
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com