Performance of Ceph Mgr and fetching daemon counters

Wido den Hollander <wido@xxxxxxxx> · Mon, 6 Aug 2018 18:04:05 +0200

Hi,

I'm busy with a customer trying to speed up the Influx and Telegraf
module to gather statistics of their cluster with 2.000 OSDs.

The problem I'm running into is the performance of the Influx module,
but this seems to boil down to the Mgr daemon.

Gathering and sending all statistics of the cluster takes about 35
seconds with the current code of the Influx module.

By using iterators, queues and multi-threading I was able to bring this
down to ~20 seconds, but the main problem is this piece of code:

    for daemon, counters in six.iteritems(self.get_all_perf_counters()):
        svc_type, svc_id = daemon.split(".", 1)
        metadata = self.get_metadata(svc_type, svc_id)

        for path, counter_info in counters.items():
            if counter_info['type'] & self.PERFCOUNTER_HISTOGRAM:
                continue

Gathering all the performance counters and metadata of these 2.000
daemons brings to grant total to about 95k data points.

Influx flushes this within just a few seconds, but it takes the Mgr
daemon a lot more time to spit them out.

I also see that ceph-mgr daemon starts to use a lot of CPU when going
through this.

The Telegraf module also suffers from this as it uses the same code path
to fetch these counters.

Is there anything we can do better inside the modules? Or something to
be improved inside the Mgr?

Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html