High CPU usage by ceph-mgr in 14.2.6

jbardgett@xxxxxxxxxxx · Wed, 29 Jan 2020 00:19:40 -0000

After upgrading one of our clusters from Luminous 12.2.12 to Nautilus 14.2.6, I am seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H').  The way we found this was due to Prometheus being unable to report out certain pieces of data, specifically OSD Usage, OSD Apply and Commit Latency.  Which are all similar issues people were having in previous versions of Nautilus.

Bryan Stillwell reported this previously on a separate cluster, 14.2.5, we have here:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/VW3GNVJGOOWA5RMUULRMZCQL5OEY44N7/#6QNDSLMHDVN7AZ3T6OPGU3YOJYAVUAEY

That issue was resolved with the upgrade to 14.2.6.

We are seeing a similar issue on this other cluster with a couple differences.

This cluster has 1900+ OSD in it, the previous one had 300+
The top user is libceph-common, instead of mmap 

4.86%  libceph-common.so.0               [.] EventCenter::create_time_event
2.78%  [kernel]                                     [k] nmi
2.64%  libstdc++.so.6.0.19                   [.] __dynamic_cast

On all our other clusters that have been upgraded to 14.2.6 we are not experiencing this issue, the next largest being 800+ OSD.

We feel this is related to the size of the cluster, similarly to the previous report.

Anyone else experiencing this and/or can provide some direction on how to go about resolving this?

Thanks,
Joe
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx