Re: Larger number of OSDs, cheroot, cherrypy, limits + containers == broken

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ken,

Thank you for the update! As per:
https://github.com/ceph/ceph-container/issues/1748

We implemented the (dropping ulimit to 1024:4096 for mgr) suggested change
last night, and on our test cluster of 504 OSDs, being polled by the
internal prometheus and our external instance, the mgrs stopped responding
and dropped out of the cluster entirely. This is impacting not just
metrics, but the mgr itself. I think this is a high priority issue, as
metrics are critical for prod, but mgr itself seems to be impacted on a
moderately sized cluster.

Respectfully,
David Orman

On Mon, Dec 7, 2020 at 1:50 PM Ken Dreyer <kdreyer@xxxxxxxxxx> wrote:

> Thanks for bringing this up.
>
> We need to update Cheroot in Fedora and EPEL 8. I've opened
> https://src.fedoraproject.org/rpms/python-cheroot/pull-request/3 to
> get this into Fedora first.
>
> I've published an el8 RPM at
> https://fedorapeople.org/~ktdreyer/bz1868629/ for early testing. I can
> bring up a "hello world" cherrypy app with this, but I've not tested
> it with Ceph.
>
> - Ken
>
> On Mon, Dec 7, 2020 at 9:57 AM David Orman <ormandj@xxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > We have a ceph 15.2.7 deployment using cephadm under podman w/ systemd.
> > We've run into what we believe is:
> >
> > https://github.com/ceph/ceph-container/issues/1748
> > https://tracker.ceph.com/issues/47875
> >
> > In our case, eventually the mgr container stops emitting output/logging.
> We
> > are polling with external prometheus clusters, which is likely what
> > triggers the issue, as it appears some amount of time after the container
> > is spawned.
> >
> > Unfortunately, setting limits in the systemd service file for the mgr
> > service on the host OS doesn't work, nor does modifying the unit.run file
> > which is used to start the container under podman to include the --ulimit
> > settings as suggested. Looking inside the container:
> >
> > lib/systemd/system/ceph-mgr@.service:LimitNOFILE=1048576
> >
> > This prevents us from deploying medium to large ceph clusters, so I would
> > argue it's a high priority bug that should not be closed, unless there
> is a
> > workaround that works until EPEL 8 contains the fixed version of cheroot
> > and the ceph containers include it.
> >
> > My understanding is this was fixed in cheroot 8.4.0:
> >
> > https://github.com/cherrypy/cheroot/issues/249
> > https://github.com/cherrypy/cheroot/pull/301
> >
> > Thank you in advance for any suggestions,
> > David
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux