We do not use containers. Anything special for debugging or should we try something from previous email? - Enable profiling (Mark Nelson) - Try Bloomberg's Python mem profiler <https://github.com/bloomberg/memray> (Matthew Leonard) Profiling means instructions from https://docs.ceph.com/en/pacific/rados/troubleshooting/memory-profiling/ ? Rok On Thu, Sep 7, 2023 at 9:34 PM Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote: > Hi Rok, > > We're still try to catch what's causing the memory growth, so it's hard > to guess at which releases are affected. We know it's happening > intermittently on a live Pacific cluster at least. If you have the > ability to catch it while it's happening, there are several > approaches/tools that might aid in diagnosing it. Container deployments > are a bit tougher to get debugging tools working in though which afaik > has slowed down existing attempts at diagnosing the issue. > > Mark > > On 9/7/23 05:55, Rok Jaklič wrote: > > Hi, > > > > we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on > > 120T/200T data. > > > > Is there any tracker about the problem? > > > > Does upgrade to 17.x "solves" the problem? > > > > Kind regards, > > Rok > > > > > > > > On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta <epuertat@xxxxxxxxxx> > wrote: > > > >> Dear Cephers, > >> > >> Today brought us an eventful CTL meeting: it looks like Jitsi recently > >> started > >> requiring user authentication > >> <https://jitsi.org/blog/authentication-on-meet-jit-si/> (anonymous > users > >> will get a "Waiting for a moderator" modal), but authentication didn't > work > >> against Google or GitHub accounts, so we had to move to the good old > Google > >> Meet. > >> > >> As a result of this, Neha has kindly set up a new private Slack channel > >> (#clt) to allow for quicker communication among CLT members (if you > usually > >> attend the CLT meeting and have not been added, please ping any CLT > member > >> to request that). > >> > >> Now, let's move on the important stuff: > >> > >> *The latest Pacific Release (v16.2.14)* > >> > >> *The Bad* > >> The 14th drop of the Pacific release has landed with a few hiccups: > >> > >> - Some .deb packages were made available to downloads.ceph.com > before > >> the release process completion. Although this is not the first time > it > >> happens, we want to ensure this is the last one, so we'd like to > gather > >> ideas to improve the release publishing process. Neha encouraged > >> everyone > >> to share ideas here: > >> - https://tracker.ceph.com/issues/62671 > >> - https://tracker.ceph.com/issues/62672 > >> - v16.2.14 also hit issues during the ceph-container stage. Laura > >> wanted to raise awareness of its current setbacks > >> <https://pad.ceph.com/p/16.2.14-struggles> and collect ideas to > tackle > >> them: > >> - Enforce reviews and mandatory CI checks > >> - Rework the current approach to use simple Dockerfiles > >> <https://github.com/ceph/ceph/pull/43292> > >> - Call the Ceph community for help: ceph-container is currently > >> maintained part-time by a single contributor (Guillaume Abrioux). > >> This > >> sub-project would benefit from the sound expertise on containers > >> among Ceph > >> users. If you have ever considered contributing to Ceph, but > felt a > >> bit > >> intimidated by C++, Paxos and race conditions, ceph-container is > a > >> good > >> place to shed your fear. > >> > >> > >> *The Good* > >> Not everything about v16.2.14 was going to be bleak: David Orman > brought us > >> really good news. They tested v16.2.14 on a large production cluster > >> (10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue > >> affecting RGW in Pacific <https://github.com/ceph/ceph/pull/52552>. > >> > >> *The Ugly* > >> During that testing, they noticed that ceph-mgr was occasionally OOM > killed > >> (nothing new to 16.2.14, as it was previously reported). They already > >> tried: > >> > >> - Disabling modules (like the restful one, which was a suspect) > >> - Enabling debug 20 > >> - Turning the pg autoscaler off > >> > >> Debugging will continue to characterize this issue: > >> > >> - Enable profiling (Mark Nelson) > >> - Try Bloomberg's Python mem profiler > >> <https://github.com/bloomberg/memray> (Matthew Leonard) > >> > >> > >> *Infrastructure* > >> > >> *Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time* > >> > >> Patrick brought up the following topics: > >> > >> - Need to reduce the OVH spending ($72k/year, which is a good cut > in the > >> Ceph Foundation budget, that's a lot less avocado sandwiches for the > >> next > >> Cephalocon): > >> - Move services (e.g.: Chacra) to the Sepia lab > >> - Re-use CentOS (and any spared/unused) machines for devel > purposes > >> - Current Ceph sys admins are overloaded, so devel/community > involvement > >> would be much appreciated. > >> - More to be discussed in tomorrow's meeting. Please join if you > >> think you can help solve/improve the Ceph infrastrucru! > >> > >> > >> *BTW*: today's CDM will be canceled, since no topics were proposed. > >> > >> Kind Regards, > >> > >> Ernesto > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx