Re: health checks and logging

Sage Weil <sweil@xxxxxxxxxx> · Tue, 6 Jun 2017 21:29:10 +0000 (UTC)



On Mon, 5 Jun 2017, John Spray wrote:
> On Mon, Jun 5, 2017 at 10:21 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > I took a quick look at the get_health() methods in the Monitor after our
> > discussion this morning:
> >
> > - OSDMonitor::get_health() looks at the pool stats for a few things; I
> > think these can be safely/easily moved to PGMap::get_health() (so that
> > they will run in ceph-mgr)
> > - Then it'll be an easy change to calculate the health and detail sets in
> > encoding_pending as each OSDMap in published.
> > - MgrStatMonitor is already persisting the mgr health messages.
> > - MDSMonitor is also strictly a fundion of the FSMap so it'd be easy to
> > move to encode_pending.
> > - Monitor::get_health() has some odds and ends we can either leave in
> > place or improve (e.g, time skew checks).  Not sure it matters much.
> >
> > My main question is whether you had specific thoughts about how to
> > identify warnings so that we can note when they appear and disappear.  We
> > can just go by the unique strings but then you'll see something like
> >
> >  1 osd(s) down
> >  ...
> >  1 osd(s) down cleared
> >  2 osd(s) down
> >  ...
> >
> > (or whatever we make the messages for cleared warnings look like).  Should
> > we associate a 'tag' for each message that is used to identify it, so
> > that, for example, "%d osd down" for any number of OSDs is considered the
> > "same" message and we log when it changes but don't say it has cleared?
> 
> Yes, exactly.
> 
> All the possible health messages warnings should get a unique error
> code (tag, if you like), that would be a stable thing that we explain
> in the docs, like we do for the MDS health messages[1].  Adding the
> codes for health messages generally was one of the steps on my
> favorite tracker ticket[2] (it has aged like a fine wine).
> 
> We'll need to look at the interplay between this and other logging --
> in some cases, if we're e.g. already logging nice messages for OSDs
> going up and down, then we might not want to also have log messages
> redundantly printing the health state.  We might also want to get rid
> of the places that we echo the map summary on changes like this, or at
> least put them at a lower severity than what the operator sees by
> default.  Basically, when an OSD goes down, we should make sure there
> is one log message to that effect, rather than 2 or 3.
> 
> BTW, earlier we were talking about logging things at a host/rack level
> when lots of OSDs change at once, which I didn't realize already
> existed, but now I'm failing to find it in the tree (looking in
> OSDMonitor)...?

https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L3567-L3700

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html