Re: MGR Memory Leak in Restful

"David Orman" <ormandj@xxxxxxxxxxxx> · Fri, 08 Sep 2023 10:22:26 -0500



Hi,

I do not believe this is actively being worked on, but there is a tracker open, if you can submit an update it may help attract attention/develop a fix: https://tracker.ceph.com/issues/59580

David

On Fri, Sep 8, 2023, at 03:29, Chris Palmer wrote:
> I first posted this on 17 April but did not get any response (although 
> IIRC a number of other posts referred to it).
> Seeing as MGR OOM is being discussed at the moment I am re-posting.
> These clusters are not containerized.
>
> Is this being tracked/fixed or not?
>
> Thanks, Chris
>
> -------------------------------
>
> We've hit a memory leak in the Manager Restful interface, in versions 
> 17.2.5 & 17.2.6. On our main production cluster the active MGR grew to 
> about 60G until the oom_reaper killed it, causing a successful failover 
> and restart of the failed one. We can then see that the problem is 
> recurring, actually on all 3 of our clusters.
>
> We've traced this to when we enabled full Ceph monitoring by Zabbix last 
> week. The leak is about 20GB per day, and seems to be proportional to 
> the number of PGs. For some time we just had the default settings, and 
> no memory leak, but had not got around to finding why many of the Zabbix 
> items were showing as Access Denied. We traced this to the MGR's MON 
> CAPS which were "mon 'profile mgr'".
>
> The MON logs showed recurring:
>
> log_channel(audit) log [DBG] : from='mgr.284576436 
> 192.168.xxx.xxx:0/2356365' entity='mgr.host1' cmd=[{"format": "json", 
> "prefix": "pg dump"}]:  access denied
>
>
> Changing the MGR CAPS to "mon 'allow *'" and restarting the MGR 
> immediately allowed that to work, and all the follow-on REST calls worked.
>
> log_channel(audit) log [DBG] : from='mgr.283590200 
> 192.168.xxx.xxx:0/1779' entity='mgr.host1' cmd=[{"format": "json", 
> "prefix": "pg dump"}]: dispatch
>
>
> However it has also caused the memory leak to start.
>
> We've reverted the CAPS and are back to how we were.
>
> Two questions:
> 1) No matter what the REST consumer is doing, the MGR should not 
> accumulate memory, especially as we can see that the REST TCP 
> connections have wrapped up. Is there anything more we can do to 
> diagnose this?
> 2) Setting "allow *" worked, but is there are better setting just to 
> allow the "pg dump" call (in addition to profile mgr)?
>
> Thanks, Chris
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx