Staggered failing of OSDs and mon_osd_down_out_subtree_limit

Wido den Hollander <wido@xxxxxxxx> · Thu, 16 Feb 2017 11:20:12 +0100 (CET)

Hi,

I'm looking to implement a additional config setting which goes together with mon_osd_down_out_subtree_limit.

In this case I have 'mon_osd_down_out_subtree_limit' set to 'host' to prevent a whole host from being marked as out when it fails.

I ran in to the situation where not all OSDs failed at the same time, but staggered. The disk controller was giving issues and slowly on OSD after the other started to fail. This meant that they were not all being marked as out in the same window of mon_osd_down_out_interval (3600), but after that.

When the whole host fails at once none of the OSDs are marked as out. This is very easy to reproduce on VMs. Just stop the OSDs on by one with a interval in between.

Only the last OSD was not marked as out since that meant the whole subtree would be marked as out.

I am thinking of mon_osd_down_out_subtree_max_osd

The default would be zero, but anything greater then zero would mean the MON would check if there are not already X OSDs out in the same subtree before marking it as out.

It would log to clog with a WRN message saying it will not mark these OSDs as out since it would go over the limit of OSDs inside that subtree.

Does this sound like a sane thing to implement?

Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html