Hi Ken, This seems to have fixed that issue. It exposed another: https://tracker.ceph.com/issues/39264 which is causing ceph-mgr to become entirely unresponsive across the cluster, but cheroot seems to be ok. David On Wed, Dec 9, 2020 at 12:25 PM David Orman <ormandj@xxxxxxxxxxxx> wrote: > Ken, > > We have rebuilt the container images of 15.2.7 with this RPM applied, and > will be deploying it to a larger (504 OSD) cluster to test - this cluster > had the issue previously until we disabled polling via Prometheus. We will > update as soon as it's run for a day or two and we've been able to verify > the mgr issues we saw no longer occur after extended polling via external > and internal prometheus instances. > > Thank you again for the quick update, we'll let you know as soon as we > have more feedback, > David > > On Tue, Dec 8, 2020 at 10:37 AM David Orman <ormandj@xxxxxxxxxxxx> wrote: > >> Hi Ken, >> >> Thank you for the update! As per: >> https://github.com/ceph/ceph-container/issues/1748 >> >> We implemented the (dropping ulimit to 1024:4096 for mgr) suggested >> change last night, and on our test cluster of 504 OSDs, being polled by the >> internal prometheus and our external instance, the mgrs stopped responding >> and dropped out of the cluster entirely. This is impacting not just >> metrics, but the mgr itself. I think this is a high priority issue, as >> metrics are critical for prod, but mgr itself seems to be impacted on a >> moderately sized cluster. >> >> Respectfully, >> David Orman >> >> On Mon, Dec 7, 2020 at 1:50 PM Ken Dreyer <kdreyer@xxxxxxxxxx> wrote: >> >>> Thanks for bringing this up. >>> >>> We need to update Cheroot in Fedora and EPEL 8. I've opened >>> https://src.fedoraproject.org/rpms/python-cheroot/pull-request/3 to >>> get this into Fedora first. >>> >>> I've published an el8 RPM at >>> https://fedorapeople.org/~ktdreyer/bz1868629/ for early testing. I can >>> bring up a "hello world" cherrypy app with this, but I've not tested >>> it with Ceph. >>> >>> - Ken >>> >>> On Mon, Dec 7, 2020 at 9:57 AM David Orman <ormandj@xxxxxxxxxxxx> wrote: >>> > >>> > Hi, >>> > >>> > We have a ceph 15.2.7 deployment using cephadm under podman w/ systemd. >>> > We've run into what we believe is: >>> > >>> > https://github.com/ceph/ceph-container/issues/1748 >>> > https://tracker.ceph.com/issues/47875 >>> > >>> > In our case, eventually the mgr container stops emitting >>> output/logging. We >>> > are polling with external prometheus clusters, which is likely what >>> > triggers the issue, as it appears some amount of time after the >>> container >>> > is spawned. >>> > >>> > Unfortunately, setting limits in the systemd service file for the mgr >>> > service on the host OS doesn't work, nor does modifying the unit.run >>> file >>> > which is used to start the container under podman to include the >>> --ulimit >>> > settings as suggested. Looking inside the container: >>> > >>> > lib/systemd/system/ceph-mgr@.service:LimitNOFILE=1048576 >>> > >>> > This prevents us from deploying medium to large ceph clusters, so I >>> would >>> > argue it's a high priority bug that should not be closed, unless there >>> is a >>> > workaround that works until EPEL 8 contains the fixed version of >>> cheroot >>> > and the ceph containers include it. >>> > >>> > My understanding is this was fixed in cheroot 8.4.0: >>> > >>> > https://github.com/cherrypy/cheroot/issues/249 >>> > https://github.com/cherrypy/cheroot/pull/301 >>> > >>> > Thank you in advance for any suggestions, >>> > David >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> > >>> >>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx