On Fri, May 12, 2017 at 11:07 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Fri, 12 May 2017, Brad Hubbard wrote: >> On Thu, May 11, 2017 at 10:47 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> > On Thu, 11 May 2017, John Spray wrote: >> >> On Thu, May 11, 2017 at 12:52 PM, Jan Fajerski <jfajerski@xxxxxxxx> wrote: >> >> > Hi list, >> >> > I recently looked into Ceph monitoring with prometheus. There is already a >> >> > ceph exporter for this purpose here >> >> > https://github.com/digitalocean/ceph_exporter. >> >> > >> >> > Prometheus encourages software projects to instrument their code directly >> >> > and expose this data, instead of using an external piece of code. Several >> >> > libraries are provided for this purpose: >> >> > https://prometheus.io/docs/instrumenting/clientlibs/ >> >> > >> >> > I think there are arguments for adding this instrumentation to Ceph >> >> > directly. Generally speaking it should reduce overall complexity in the >> >> > code (no extra exporter component outside of ceph) and in operations (no >> >> > extra package and configuration). >> >> > >> >> > The direct instrumentation could happen in two places: >> >> > 1) >> >> > Directly in Cephs C++ code using https://github.com/jupp0r/prometheus-cpp. >> >> > This would mean daemons expose their metrics directly via the prometheus >> >> > http interface. This would be the most direct way of exposing metrics, >> >> > prometheus would simply poll all endpoints. Service discovery for scrape >> >> > targets, say added or removed OSDS, would however have to be handled >> >> > somewhere. For orchestration tools à la k8s, ansible, salt, ... either have >> >> > this feature already or it would be simple enough to add. Deployments not >> >> > using a tool like that need another approach. Prometheus offer various >> >> > mechanisms >> >> > (https://prometheus.io/docs/operating/configuration/#%3Cscrape_config%3E) or >> >> > a ceph component (say mon or mgr) could handle this. >> >> > >> >> > 2) >> >> > Add a ceph-mgr plugin that exposes the metrics available to ceph-mgr as a >> >> > prometheus scrape target (using >> >> > https://github.com/prometheus/client_python). This would handle the service >> >> > discovery issue for ceph daemons out of the box (though not for the actual >> >> > mgr-daemon which is the scrape target). The code would also be in a central >> >> > location instead of being scattered in several places. It does however add a >> >> > (maybe pointless) level of indirection ($ceph_daemon -> ceph-mgr -> >> >> > prometheus) and adds the need for two different scrape intervals (assuming >> >> > mgr polls metrics from daemons). >> >> >> >> I would love to see a mgr module for prometheus integration! >> > >> > Me too! It might make more sense to do it in C++ than python, though, for >> > performance reasons. >> >> Can we define "metrics" here? What, specifically, are we planning to gather? >> >> Let's start with an example from "ceph_exporter". It exposes a metric >> ApplyLatency which it obtains by connecting to the cluster via a rados client >> connection and running the "osd perf" command and gathering the apply_latency_ms >> result. I believe this stat is the equivalent of the apply_latency perf counters >> statistic. >> >> Does the manager currently export the performance counters? If not option 1 is >> looking more viable for gathering these sorts (think "perf dump") of >> metrics unless the manager can proxy calls such as "osd perf" back to the MONs? > > Right now all of the perfcounters are reported to ceph-mgr. We shouldn't > need to do 'osd perf' (which is just reporting those 2 metrics that the > osds have historically reported to the mon). Ah, in DaemonState.* and MgrClient.cc. I see the mechanics now, thanks. > >> Part of the problem with gathering metrics from ceph is working out what set of >> metrics you want to collect from a large assortment available IMHO. > > We could collect them all. Or, we recently introduced a 'priority' field > so we can collect everything above a threshold (although then we have to > go assign meaningful priorities to most of the counters). > > BTW one of the cool things about prometheus is that it has a histogram > type, which means we can take our 2d histogram data and report that > (flattened into one or the other dimension). > > sage -- Cheers, Brad -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html