On Mon, 5 Jun 2017, John Spray wrote: > On Mon, Jun 5, 2017 at 10:21 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > I took a quick look at the get_health() methods in the Monitor after our > > discussion this morning: > > > > - OSDMonitor::get_health() looks at the pool stats for a few things; I > > think these can be safely/easily moved to PGMap::get_health() (so that > > they will run in ceph-mgr) > > - Then it'll be an easy change to calculate the health and detail sets in > > encoding_pending as each OSDMap in published. > > - MgrStatMonitor is already persisting the mgr health messages. > > - MDSMonitor is also strictly a fundion of the FSMap so it'd be easy to > > move to encode_pending. > > - Monitor::get_health() has some odds and ends we can either leave in > > place or improve (e.g, time skew checks). Not sure it matters much. > > > > My main question is whether you had specific thoughts about how to > > identify warnings so that we can note when they appear and disappear. We > > can just go by the unique strings but then you'll see something like > > > > 1 osd(s) down > > ... > > 1 osd(s) down cleared > > 2 osd(s) down > > ... > > > > (or whatever we make the messages for cleared warnings look like). Should > > we associate a 'tag' for each message that is used to identify it, so > > that, for example, "%d osd down" for any number of OSDs is considered the > > "same" message and we log when it changes but don't say it has cleared? > > Yes, exactly. > > All the possible health messages warnings should get a unique error > code (tag, if you like), that would be a stable thing that we explain > in the docs, like we do for the MDS health messages[1]. Adding the > codes for health messages generally was one of the steps on my > favorite tracker ticket[2] (it has aged like a fine wine). > > We'll need to look at the interplay between this and other logging -- > in some cases, if we're e.g. already logging nice messages for OSDs > going up and down, then we might not want to also have log messages > redundantly printing the health state. We might also want to get rid > of the places that we echo the map summary on changes like this, or at > least put them at a lower severity than what the operator sees by > default. Basically, when an OSD goes down, we should make sure there > is one log message to that effect, rather than 2 or 3. > > BTW, earlier we were talking about logging things at a host/rack level > when lots of OSDs change at once, which I didn't realize already > existed, but now I'm failing to find it in the tree (looking in > OSDMonitor)...? https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L3567-L3700 sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html