Re: "1 hosts down" health warning?

Sage Weil <sweil@xxxxxxxxxx> · Tue, 10 May 2016 08:11:37 -0400 (EDT)

On Tue, 10 May 2016, Wido den Hollander wrote:
> > Op 9 mei 2016 om 18:36 schreef Sage Weil <sweil@xxxxxxxxxx>:
> > 
> > 
> > We have a feature (mon_osd_reporter_subtree_level = host) that makes it so 
> > that if an entire host is down (or whatever the configured hierarchy 
> > level is) the osds aren't automatically marked out after 5 minutes.
> > 
> > This is confusing on an actual cluster because you see something like
> > 
> >             48/5661 in osds are down
> > 
> > bit it never clears.  It's not until you look at the ceph osd tree output 
> > that you can see why they aren't getting marked out.
> > 
> > It would be great if the health warning said something like
> > 
> >             48/5661 in osds are down
> >             1/142 hosts are down (accounting for 48/48 down osds)
> > 
> > and the health detail said something like
> > 
> >   host foo is down with 48 OSDs
> > 
> > I think this would be pretty easy to implement given the mon is 
> > already doing the subtree-based checks.
> > 
> > Thoughts? Any takers?
> 
> Seems like a good thing to have. I wouldn't say 'host', since 
> 'mon_osd_reporter_subtree_level' could be set to rack or row if you want 
> to.
> 
> Maybe:
> 
>             480/6720 in osds are down
>             1/14 of CRUSH type 'rack' are down (accounting for 480/480 down osds)

Yeah.  I was thinking it'd be

             1/14 ${type}s are down (accounting for 480/480 down osds)

e.g.,

             1/14 racks are down (accounting for 480/480 down osds)

just because concise is usually better.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html