Re: quincy v17.2.0 QE Validation status

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A patch to revert the pids-limit change has been opened
https://github.com/ceph/ceph/pull/45932. We created a build with that patch
put on top of the current quincy branch and are upgrading the LRC to it
now. So far it seems to be going alright. All the mgr, mon and crash
daemons have been upgraded with no issue and it is currently upgrading the
osds so the patch seems to be working.

Additionally, there was some investigation this morning in order to get
the cluster back into a good state. The mgr daemons were redeployed with a
version with https://github.com/ceph/ceph/pull/45853. While we aren't going
with that patch for now, importantly it would cause us to not deploy other
mgr daemons with the pids-limit set. From that point, modifying the
mgr daemons unit.run file to remove the --pids-limit section and restarting
the mgr daemons' systemd unit and then upgrading the cluster fully to this
patches image got things back into a stable position. This proved fully the
pids-limit was causing the issue in the cluster. From that point we didn't
touch the cluster further until this new upgrade to the version with the
reversion.

To summarize, the reversion on top of the current quincy branch seems to be
working okay and we should be ready to make a new final build based on that.

Thanks,
  - Adam King

On Mon, Apr 18, 2022 at 9:36 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:

> On Fri, Apr 15, 2022 at 3:10 AM David Galloway <dgallowa@xxxxxxxxxx>
> wrote:
> >
> > For transparency and posterity's sake...
> >
> > I tried upgrading the LRC and the first two mgrs upgraded fine but
> > reesi004 threw an error.
> >
> > Apr 14 22:54:36 reesi004 podman[2042265]: 2022-04-14 22:54:36.210874346
> > +0000 UTC m=+0.138897862 container create
> > 3991bea0a86f55679f9892b3fbceeef558dd1edad94eb4bf73deebf6595bcc99
> > (image=
> quay.ceph.io/ceph-ci/ceph@sha256:230120c6a429af7546b91180a3da39846e760787580d7b5193487
> > Apr 14 22:54:36 reesi004 bash[2042070]: Error: OCI runtime error:
> > writing file `pids.max`: Invalid argument
> >
> > Adam and I suspected we needed
> > https://github.com/ceph/ceph/pull/45853#issue-1200032778 so I took the
> > tip of quincy, cherry-picked that PR and pushed to dgalloway-quincy-fix
> > in ceph-ci.git.  Then I waited for packages and a container to get built
> > and attempted to upgrade the LRC to that container version.
> >
> > Same error though.  So I'm leaving it for the weekend.  We have two MGRs
> > that *did* upgrade to the tip of quincy but the rest of the containers
> > are still running 17.1.0-5-g8299cd4c.
>
> I don't think https://github.com/ceph/ceph/pull/45853 would help.
> The problem appears to be that --pids-limit=-1 just doesn't work on
> older podman versions.  "-1" is not massaged there and is attempted to
> be written to /sys/fs/cgroup/pids/.../pids.max, which fails because
> pids.max file expects either a non-negative integer or "max" [1].
> I don't understand how some of the other manager daemons upgraded
> though, since the LRC nodes appear to be running Ubuntu 18.04 LTS with
> an older podman:
>
>     $ podman --version
>     podman version 3.0.1
>
> This was reported in [2] and addressed in podman in [3], fairly
> recently.  Their fix was to make "-1" be treated the same as "0", as
> older podman versions insisted on "0" for unlimited and "-1" either
> never worked or stopped working a long time ago.  docker seems to
> accept both "-1" and "0" for unlimited.
>
> The best of course of action would probably be to drop [4] from quincy,
> getting it back to 17.1.0 state (i.e. no --pids-limit option in sight)
> and amend the original --pids-limit change in master so that it works
> for all versions of podman.  The podman version is already checked in
> a couple of places (e.g. CGROUPS_SPLIT_PODMAN_VERSION) so it should be
> easy enough or we could just unconditionally pass "0" even though it
> is not documented anymore.
>
> (The reason for backporting [4] to quincy was to fix containerized
> iSCSI deployments where bumping into default PID limit is just a matter
> of scaling the number of exported LUNs.  It's been that way since the
> initial pacific release though so taking it out for now is completely
> acceptable.)
>
> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt
> [2] https://github.com/containers/podman/issues/11782
> [3] https://github.com/containers/podman/pull/11794
> [4] https://github.com/ceph/ceph/pull/45576
>
> Thanks,
>
>             Ilya
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux