[Nautilus 14.2.10] MGR high load issues

"Johannes L" <johannes.liebl@xxxxxxxx> · Thu, 09 Jul 2020 07:55:24 -0000

Hello Ceph-Devs,

we have noticed a rise in overall load from the MGR daemon after upgrading to Nautilus 14.2.9 from Luminous 12.2.13. This has resulted in the Prometheus module not being able to respond due to overload while an OSD is out for example. We evaluated this on our test clusters with recent hardware and the issues still persisted and even getting worse with gaps in the Prometheus metric collection while the cluster is being written to in a perfectly healthy state.

After some digging and hoping the pull request from https://tracker.ceph.com/issues/45439 (https://github.com/ceph/ceph/pull/34356) Elatives the issue, which it didn't, we have traced most of our troubles down to the Progress MGR module:

The notify function in the progress module is highly inefficient in its current form due to unnecessary collection of PG data when nothing is beieng done with it (self._events being empty).
This results in the Prometheus module being blocked regularly and thus not responding in time (response times of > 10 seconds, or even outright cherrypy timeouts)

We have prepared an issue ticket and a Pull request for this to be fixed:

https://tracker.ceph.com/issues/46416
https://github.com/ceph/ceph/pull/35973

After implementing this easy fix we haven't experienced any Prometheus timeouts.

Could someone please review, merge and Backport this pull request.

Thanks in advance
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx