Re: ceph_leadership_team_meeting_s18e06.mkv

Rok Jaklič <rjaklic@xxxxxxxxx> · Fri, 8 Sep 2023 09:38:14 +0200

We do not use containers.

Anything special for debugging or should we try something from previous
email?
   - Enable profiling (Mark Nelson)
   - Try Bloomberg's Python mem profiler
   <https://github.com/bloomberg/memray> (Matthew Leonard)

Profiling means instructions from
https://docs.ceph.com/en/pacific/rados/troubleshooting/memory-profiling/ ?

Rok

On Thu, Sep 7, 2023 at 9:34 PM Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote:

> Hi Rok,
>
> We're still try to catch what's causing the memory growth, so it's hard
> to guess at which releases are affected.  We know it's happening
> intermittently on a live Pacific cluster at least.  If you have the
> ability to catch it while it's happening, there are several
> approaches/tools that might aid in diagnosing it. Container deployments
> are a bit tougher to get debugging tools working in though which afaik
> has slowed down existing attempts at diagnosing the issue.
>
> Mark
>
> On 9/7/23 05:55, Rok Jaklič wrote:
> > Hi,
> >
> > we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
> > 120T/200T data.
> >
> > Is there any tracker about the problem?
> >
> > Does upgrade to 17.x "solves" the problem?
> >
> > Kind regards,
> > Rok
> >
> >
> >
> > On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta <epuertat@xxxxxxxxxx>
> wrote:
> >
> >> Dear Cephers,
> >>
> >> Today brought us an eventful CTL meeting: it looks like Jitsi recently
> >> started
> >> requiring user authentication
> >> <https://jitsi.org/blog/authentication-on-meet-jit-si/> (anonymous
> users
> >> will get a "Waiting for a moderator" modal), but authentication didn't
> work
> >> against Google or GitHub accounts, so we had to move to the good old
> Google
> >> Meet.
> >>
> >> As a result of this, Neha has kindly set up a new private Slack channel
> >> (#clt) to allow for quicker communication among CLT members (if you
> usually
> >> attend the CLT meeting and have not been added, please ping any CLT
> member
> >> to request that).
> >>
> >> Now, let's move on the important stuff:
> >>
> >> *The latest Pacific Release (v16.2.14)*
> >>
> >> *The Bad*
> >> The 14th drop of the Pacific release has landed with a few hiccups:
> >>
> >>     - Some .deb packages were made available to downloads.ceph.com
> before
> >>     the release process completion. Although this is not the first time
> it
> >>     happens, we want to ensure this is the last one, so we'd like to
> gather
> >>     ideas to improve the release publishing process. Neha encouraged
> >> everyone
> >>     to share ideas here:
> >>        - https://tracker.ceph.com/issues/62671
> >>        - https://tracker.ceph.com/issues/62672
> >>        - v16.2.14 also hit issues during the ceph-container stage. Laura
> >>     wanted to raise awareness of its current setbacks
> >>     <https://pad.ceph.com/p/16.2.14-struggles> and collect ideas to
> tackle
> >>     them:
> >>        - Enforce reviews and mandatory CI checks
> >>        - Rework the current approach to use simple Dockerfiles
> >>        <https://github.com/ceph/ceph/pull/43292>
> >>        - Call the Ceph community for help: ceph-container is currently
> >>        maintained part-time by a single contributor (Guillaume Abrioux).
> >> This
> >>        sub-project would benefit from the sound expertise on containers
> >> among Ceph
> >>        users. If you have ever considered contributing to Ceph, but
> felt a
> >> bit
> >>        intimidated by C++, Paxos and race conditions, ceph-container is
> a
> >> good
> >>        place to shed your fear.
> >>
> >>
> >> *The Good*
> >> Not everything about v16.2.14 was going to be bleak: David Orman
> brought us
> >> really good news. They tested v16.2.14 on a large production cluster
> >> (10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
> >> affecting RGW in Pacific <https://github.com/ceph/ceph/pull/52552>.
> >>
> >> *The Ugly*
> >> During that testing, they noticed that ceph-mgr was occasionally OOM
> killed
> >> (nothing new to 16.2.14, as it was previously reported). They already
> >> tried:
> >>
> >>     - Disabling modules (like the restful one, which was a suspect)
> >>     - Enabling debug 20
> >>     - Turning the pg autoscaler off
> >>
> >> Debugging will continue to characterize this issue:
> >>
> >>     - Enable profiling (Mark Nelson)
> >>     - Try Bloomberg's Python mem profiler
> >>     <https://github.com/bloomberg/memray> (Matthew Leonard)
> >>
> >>
> >> *Infrastructure*
> >>
> >> *Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*
> >>
> >> Patrick brought up the following topics:
> >>
> >>     - Need to reduce the OVH spending ($72k/year, which is a good cut
> in the
> >>     Ceph Foundation budget, that's a lot less avocado sandwiches for the
> >> next
> >>     Cephalocon):
> >>        - Move services (e.g.: Chacra) to the Sepia lab
> >>        - Re-use CentOS (and any spared/unused) machines for devel
> purposes
> >>     - Current Ceph sys admins are overloaded, so devel/community
> involvement
> >>     would be much appreciated.
> >>     - More to be discussed in tomorrow's meeting. Please join if you
> >>     think you can help solve/improve the Ceph infrastrucru!
> >>
> >>
> >> *BTW*: today's CDM will be canceled, since no topics were proposed.
> >>
> >> Kind Regards,
> >>
> >> Ernesto
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx