Re: "1 hosts down" health warning?

Wido den Hollander <wido@xxxxxxxx> · Tue, 10 May 2016 08:05:50 +0200 (CEST)

> Op 9 mei 2016 om 18:36 schreef Sage Weil <sweil@xxxxxxxxxx>:
> 
> 
> We have a feature (mon_osd_reporter_subtree_level = host) that makes it so 
> that if an entire host is down (or whatever the configured hierarchy 
> level is) the osds aren't automatically marked out after 5 minutes.
> 
> This is confusing on an actual cluster because you see something like
> 
>             48/5661 in osds are down
> 
> bit it never clears.  It's not until you look at the ceph osd tree output 
> that you can see why they aren't getting marked out.
> 
> It would be great if the health warning said something like
> 
>             48/5661 in osds are down
>             1/142 hosts are down (accounting for 48/48 down osds)
> 
> and the health detail said something like
> 
>   host foo is down with 48 OSDs
> 
> I think this would be pretty easy to implement given the mon is 
> already doing the subtree-based checks.
> 
> Thoughts? Any takers?

Seems like a good thing to have. I wouldn't say 'host', since 'mon_osd_reporter_subtree_level' could be set to rack or row if you want to.

Maybe:

            480/6720 in osds are down
            1/14 of CRUSH type 'rack' are down (accounting for 480/480 down osds)

That seems more logical to me.

Wido

> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html