On Thu, Mar 29, 2018 at 10:16 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Thu, Mar 29, 2018 at 3:25 AM, John Spray <jspray@xxxxxxxxxx> wrote: >> On Wed, Mar 28, 2018 at 11:44 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >>> I had an amusing little problem today with a bug report about IO >>> pausing on a cluster when OSDs are killed. Naturally, the first thing >>> I wanted to do was see if it was the result of OSDs not getting marked >>> down, or if the PGs were not peering quickly after that. >>> >>> Only it turns out that in Luminous, we no longer log the pg states to >>> any single log I can find. ceph.log now contains only the health >>> summary; I wasn't provided the mgr log but it appears to require debug >>> 10 before printing out individual states. >> >> Let's change that to something like 4 instead of 10 so that it's at >> least easier to get at them directly on the daemon? >> >>> This means the only way to >>> get them is to have a high debug value while the logs are running (and >>> I don't think this is something people are used to on the manager >>> yet), and that any issues in the field will be difficult to resolve if >>> they aren't immediately reproducible. >> >> The purist answer is that the PG states are included in the prometheus >> output, which is a neater way of getting this kind of history of >> quantitative things. However, I'm not a purist, so... > > Yeah, this is a neat solution but I think for real-world debugging we > need a better transition from the current state of affairs. > > Is there any plausible way for picking up prometheus states via the > existing ceph debugging tools, or for integrating that with the Ceph > logging events? If not I think we need it in-situ. Plus, isn't the > in-memory prometheus logging quite short? Reinstating the logging (with the LogMonitor improvement to avoid filling up a global buffer) is probably the only real "existing tools" path here. The prometheus module is just giving you a moment-in-time view, so it doesn't do anything for you without an external thing querying it. Of course, it's also very easy to write a few lines of python that hits the prometheus endpoint and logs something even N seconds, but at that point I'd just install prometheus. John > -Greg > >> >>> So: I'm pretty sure we need to log PG state changes in more detail by >>> default. Does anybody have suggestions or preferences for *how* that >>> happens? My preference is for them to show up in ceph.log... >> >> ... we could reinstate the PGMap spam at debug level in its own >> channel in the cluster log, if we made LogMonitor keep separate >> summary buffers for each channel. Currently it has one global buffer, >> which means that any regular output (like the PGMap every 5 seconds) >> will blow away the recent history of any other type of log message -- >> that was the motivation for eliminating the PGMap message rather than >> just degrading it to debug. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html