On Tue, Jun 6, 2017 at 10:29 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Mon, 5 Jun 2017, John Spray wrote: >> On Mon, Jun 5, 2017 at 10:21 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> > I took a quick look at the get_health() methods in the Monitor after our >> > discussion this morning: >> > >> > - OSDMonitor::get_health() looks at the pool stats for a few things; I >> > think these can be safely/easily moved to PGMap::get_health() (so that >> > they will run in ceph-mgr) >> > - Then it'll be an easy change to calculate the health and detail sets in >> > encoding_pending as each OSDMap in published. >> > - MgrStatMonitor is already persisting the mgr health messages. >> > - MDSMonitor is also strictly a fundion of the FSMap so it'd be easy to >> > move to encode_pending. >> > - Monitor::get_health() has some odds and ends we can either leave in >> > place or improve (e.g, time skew checks). Not sure it matters much. >> > >> > My main question is whether you had specific thoughts about how to >> > identify warnings so that we can note when they appear and disappear. We >> > can just go by the unique strings but then you'll see something like >> > >> > 1 osd(s) down >> > ... >> > 1 osd(s) down cleared >> > 2 osd(s) down >> > ... >> > >> > (or whatever we make the messages for cleared warnings look like). Should >> > we associate a 'tag' for each message that is used to identify it, so >> > that, for example, "%d osd down" for any number of OSDs is considered the >> > "same" message and we log when it changes but don't say it has cleared? >> >> Yes, exactly. >> >> All the possible health messages warnings should get a unique error >> code (tag, if you like), that would be a stable thing that we explain >> in the docs, like we do for the MDS health messages[1]. Adding the >> codes for health messages generally was one of the steps on my >> favorite tracker ticket[2] (it has aged like a fine wine). >> >> We'll need to look at the interplay between this and other logging -- >> in some cases, if we're e.g. already logging nice messages for OSDs >> going up and down, then we might not want to also have log messages >> redundantly printing the health state. We might also want to get rid >> of the places that we echo the map summary on changes like this, or at >> least put them at a lower severity than what the operator sees by >> default. Basically, when an OSD goes down, we should make sure there >> is one log message to that effect, rather than 2 or 3. >> >> BTW, earlier we were talking about logging things at a host/rack level >> when lots of OSDs change at once, which I didn't realize already >> existed, but now I'm failing to find it in the tree (looking in >> OSDMonitor)...? > > https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L3567-L3700 Thanks -- I think when we were talking about it before I was confused between the logging and the health bits. John > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html