We've hit a memory leak in the Manager Restful interface, in versions
17.2.5 & 17.2.6. On our main production cluster the active MGR grew to
about 60G until the oom_reaper killed it, causing a successful failover
and restart of the failed one. We can then see that the problem is
recurring, actually on all 3 of our clusters.
We've traced this to when we enabled full Ceph monitoring by Zabbix last
week. The leak is about 20GB per day, and seems to be proportional to
the number of PGs. For some time we just had the default settings, and
no memory leak, but had not got around to finding why many of the Zabbix
items were showing as Access Denied. We traced this to the MGR's MON
CAPS which were "mon 'profile mgr'".
The MON logs showed recurring:
log_channel(audit) log [DBG] : from='mgr.284576436 192.168.xxx.xxx:0/2356365' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]: access denied
Changing the MGR CAPS to "mon 'allow *'" and restarting the MGR
immediately allowed that to work, and all the follow-on REST calls worked.
log_channel(audit) log [DBG] : from='mgr.283590200 192.168.xxx.xxx:0/1779' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]: dispatch
However it has also caused the memory leak to start.
We've reverted the CAPS and are back to how we were.
Two questions:
1) No matter what the REST consumer is doing, the MGR should not
accumulate memory, especially as we can see that the REST TCP
connections have wrapped up. Is there anything more we can do to
diagnose this?
2) Setting "allow *" worked, but is there are better setting just to
allow the "pg dump" call (in addition to profile mgr)?
Thanks, Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx