health checks and logging

Sage Weil <sweil@xxxxxxxxxx> · Mon, 5 Jun 2017 21:21:36 +0000 (UTC)

I took a quick look at the get_health() methods in the Monitor after our 
discussion this morning:

- OSDMonitor::get_health() looks at the pool stats for a few things; I 
think these can be safely/easily moved to PGMap::get_health() (so that 
they will run in ceph-mgr)
- Then it'll be an easy change to calculate the health and detail sets in 
encoding_pending as each OSDMap in published.
- MgrStatMonitor is already persisting the mgr health messages.
- MDSMonitor is also strictly a fundion of the FSMap so it'd be easy to 
move to encode_pending.
- Monitor::get_health() has some odds and ends we can either leave in 
place or improve (e.g, time skew checks).  Not sure it matters much.

My main question is whether you had specific thoughts about how to 
identify warnings so that we can note when they appear and disappear.  We 
can just go by the unique strings but then you'll see something like

 1 osd(s) down
 ...
 1 osd(s) down cleared
 2 osd(s) down
 ...

(or whatever we make the messages for cleared warnings look like).  Should 
we associate a 'tag' for each message that is used to identify it, so 
that, for example, "%d osd down" for any number of OSDs is considered the 
"same" message and we log when it changes but don't say it has cleared?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html