On Mon, Aug 06, 2018 at 05:24:07PM +0100, John Spray wrote:
On Mon, Aug 6, 2018 at 5:04 PM Wido den Hollander <wido@xxxxxxxx> wrote:
Hi,
I'm busy with a customer trying to speed up the Influx and Telegraf
module to gather statistics of their cluster with 2.000 OSDs.
The problem I'm running into is the performance of the Influx module,
but this seems to boil down to the Mgr daemon.
Gathering and sending all statistics of the cluster takes about 35
seconds with the current code of the Influx module.
By using iterators, queues and multi-threading I was able to bring this
down to ~20 seconds, but the main problem is this piece of code:
for daemon, counters in six.iteritems(self.get_all_perf_counters()):
svc_type, svc_id = daemon.split(".", 1)
metadata = self.get_metadata(svc_type, svc_id)
for path, counter_info in counters.items():
if counter_info['type'] & self.PERFCOUNTER_HISTOGRAM:
continue
Gathering all the performance counters and metadata of these 2.000
daemons brings to grant total to about 95k data points.
Influx flushes this within just a few seconds, but it takes the Mgr
daemon a lot more time to spit them out.
I also see that ceph-mgr daemon starts to use a lot of CPU when going
through this.
The Telegraf module also suffers from this as it uses the same code path
to fetch these counters.
Is there anything we can do better inside the modules? Or something to
be improved inside the Mgr?
There's definitely room to make get_all_perf_counters *much* more
efficient. It's currently issuing individual get_counter() calls into
C++ land for every counter, and get_counter is returning the last N
values into python before get_latest throws away all but the latest.
I'd suggest implementing a C++ version of get_all_perf_counters.
There will always be some ceiling on how much data is practical in the
"one big endpoint" approach to gathering stats, but if we have a
potential order of magnitude improvement in this call then we should
do it.
Fwiw I poked at exactly that.
https://github.com/jan--f/ceph/commit/261bf054c94b4f4d06cb6baef0d4d49dd90795bf
So far the code is quite a bit slower though, likely due to it being a quick
hack and my lacking C++ foo. I'm just back to the office though and plan to pick
this up again.
John
Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)