"1 hosts down" health warning?

Sage Weil <sweil@xxxxxxxxxx> · Mon, 9 May 2016 12:36:52 -0400 (EDT)

We have a feature (mon_osd_reporter_subtree_level = host) that makes it so 
that if an entire host is down (or whatever the configured hierarchy 
level is) the osds aren't automatically marked out after 5 minutes.

This is confusing on an actual cluster because you see something like

            48/5661 in osds are down

bit it never clears.  It's not until you look at the ceph osd tree output 
that you can see why they aren't getting marked out.

It would be great if the health warning said something like

            48/5661 in osds are down
            1/142 hosts are down (accounting for 48/48 down osds)

and the health detail said something like

  host foo is down with 48 OSDs

I think this would be pretty easy to implement given the mon is 
already doing the subtree-based checks.

Thoughts? Any takers?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html