To follow up on this issue, I saw the additional comments on
https://tracker.ceph.com/issues/59580 regarding mgr caps.
By setting the mgr user caps back to the default, I was able to reduce
the memory leak from several 100MB/h to just a few MB/hr.
As the other commenter had posted, in order for zabbix to access OSD
data via RESTful, the mgr caps were set to:
ceph auth caps mgr.controller04.lvhgea mon 'allow *' osd 'allow *'
mds 'allow *'
Gary
On 2023-04-27 08:38, Gary Molenkamp wrote:
Good morning,
After upgrading from Octopus (15.2.17) to Pacific (16.2.12) two days
ago, I'm noticing that the MGR daemons keep failing over to standby
and then back every 24hrs. Watching the output of 'ceph orch ps' I
can see that the memory consumption of the mgr is steadily growing
until it becomes unresponsive.
When the mgr becomes unresponsive, tasks such as RESTful calls start
to fail, and the standby eventually takes over after ~20 minutes. I've
included a log of memory consumption (in 10 minute intervals) at the
end of this message. While the cluster recovers during this issue, the
loss of usage data during the outage, and the fact its occurring is
problematic. Any assistance would be appreciated.
Note, this is a cluster that has been upgraded from an original jewel
based ceph using filestore, through bluestore conversion, container
conversion, and now to Pacific. The data below shows memory use
with three mgr modules enabled: cephadm, restful, iostat. By
disabling iostat, I can reduce the rate of memory consumption
increasing to about 200MB/hr.
Thanks
Gary.
--
Gary Molenkamp Science Technology Services
Systems Administrator University of Western Ontario
molenkam@xxxxxx http://sts.sci.uwo.ca
(519) 661-2111 x86882 (519) 661-3566
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx