MGR Memory Leak in Restful

Chris Palmer <chris.palmer@xxxxxxxxx> · Mon, 17 Apr 2023 15:15:54 +0100

We've hit a memory leak in the Manager Restful interface, in versions 
17.2.5 & 17.2.6. On our main production cluster the active MGR grew to 
about 60G until the oom_reaper killed it, causing a successful failover 
and restart of the failed one. We can then see that the problem is 
recurring, actually on all 3 of our clusters.

We've traced this to when we enabled full Ceph monitoring by Zabbix last 
week. The leak is about 20GB per day, and seems to be proportional to 
the number of PGs. For some time we just had the default settings, and 
no memory leak, but had not got around to finding why many of the Zabbix 
items were showing as Access Denied. We traced this to the MGR's MON 
CAPS which were "mon 'profile mgr'".

The MON logs showed recurring:

log_channel(audit) log [DBG] : from='mgr.284576436 192.168.xxx.xxx:0/2356365' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]:  access denied

Changing the MGR CAPS to "mon 'allow *'" and restarting the MGR 
immediately allowed that to work, and all the follow-on REST calls worked.

log_channel(audit) log [DBG] : from='mgr.283590200 192.168.xxx.xxx:0/1779' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]: dispatch

However it has also caused the memory leak to start.

We've reverted the CAPS and are back to how we were.

Two questions:
1) No matter what the REST consumer is doing, the MGR should not 
accumulate memory, especially as we can see that the REST TCP 
connections have wrapped up. Is there anything more we can do to 
diagnose this?
2) Setting "allow *" worked, but is there are better setting just to 
allow the "pg dump" call (in addition to profile mgr)?

Thanks, Chris

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx