Re: ceph_leadership_team_meeting_s18e06.mkv

On 07/09/2023 21:33, Mark Nelson wrote:
Hi Rok,

We're still try to catch what's causing the memory growth, so it's hard to guess at which releases are affected.  We know it's happening intermittently on a live Pacific cluster at least.  If you have the ability to catch it while it's happening, there are several approaches/tools that might aid in diagnosing it. Container deployments are a bit tougher to get debugging tools working in though which afaik has slowed down existing attempts at diagnosing the issue.

We have a cluster recently upgraded from Octopus to Pacific 16.2.13 where the active MGR was OOM-killed a few times.

We have another cluster that was recently upgraded from 16.2.11 to 16.2.14 and the issue also started to appear (very soon) on that cluster.
We didn't have the issue before, during the months running 16.2.11.

In short: the issue seems to be due to a change in 16.2.12 or 16.2.13.

Loīc Tortay
