Re: quincy v17.2.0 QE Validation status

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The LRC is upgraded but the same mgr did crash during the upgrade.  It is running now despite the crash.  Adam suspects it's due to earlier breakage.

https://pastebin.com/NWzzsNgk

Shall I start the build after https://github.com/ceph/ceph/pull/45932 <https://github.com/ceph/ceph/pull/45932> gets merged?

On 4/18/22 14:02, Adam King wrote:
A patch to revert the pids-limit change has been opened https://github.com/ceph/ceph/pull/45932. We created a build with that patch put on top of the current quincy branch and are upgrading the LRC to it now. So far it seems to be going alright. All the mgr, mon and crash daemons have been upgraded with no issue and it is currently upgrading the osds so the patch seems to be working.

Additionally, there was some investigation this morning in order to get the cluster back into a good state. The mgr daemons were redeployed with a version with https://github.com/ceph/ceph/pull/45853. While we aren't going with that patch for now, importantly it would cause us to not deploy other mgr daemons with the pids-limit set. From that point, modifying the mgr daemons unit.run file to remove the --pids-limit section and restarting the mgr daemons' systemd unit and then upgrading the cluster fully to this patches image got things back into a stable position. This proved fully the pids-limit was causing the issue in the cluster. From that point we didn't touch the cluster further until this new upgrade to the version with the reversion.

To summarize, the reversion on top of the current quincy branch seems to be working okay and we should be ready to make a new final build based on that.

Thanks,
  - Adam King

On Mon, Apr 18, 2022 at 9:36 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:

    On Fri, Apr 15, 2022 at 3:10 AM David Galloway
    <dgallowa@xxxxxxxxxx> wrote:
    >
    > For transparency and posterity's sake...
    >
    > I tried upgrading the LRC and the first two mgrs upgraded fine but
    > reesi004 threw an error.
    >
    > Apr 14 22:54:36 reesi004 podman[2042265]: 2022-04-14
    22:54:36.210874346
    > +0000 UTC m=+0.138897862 container create
    > 3991bea0a86f55679f9892b3fbceeef558dd1edad94eb4bf73deebf6595bcc99
    >
    (image=quay.ceph.io/ceph-ci/ceph@sha256:230120c6a429af7546b91180a3da39846e760787580d7b5193487
    <http://quay.ceph.io/ceph-ci/ceph@sha256:230120c6a429af7546b91180a3da39846e760787580d7b5193487>
    > Apr 14 22:54:36 reesi004 bash[2042070]: Error: OCI runtime error:
    > writing file `pids.max`: Invalid argument
    >
    > Adam and I suspected we needed
    > https://github.com/ceph/ceph/pull/45853#issue-1200032778 so I
    took the
    > tip of quincy, cherry-picked that PR and pushed to
    dgalloway-quincy-fix
    > in ceph-ci.git.  Then I waited for packages and a container to
    get built
    > and attempted to upgrade the LRC to that container version.
    >
    > Same error though.  So I'm leaving it for the weekend. We have
    two MGRs
    > that *did* upgrade to the tip of quincy but the rest of the
    containers
    > are still running 17.1.0-5-g8299cd4c.

    I don't think https://github.com/ceph/ceph/pull/45853 would help.
    The problem appears to be that --pids-limit=-1 just doesn't work on
    older podman versions.  "-1" is not massaged there and is attempted to
    be written to /sys/fs/cgroup/pids/.../pids.max, which fails because
    pids.max file expects either a non-negative integer or "max" [1].
    I don't understand how some of the other manager daemons upgraded
    though, since the LRC nodes appear to be running Ubuntu 18.04 LTS with
    an older podman:

        $ podman --version
        podman version 3.0.1

    This was reported in [2] and addressed in podman in [3], fairly
    recently.  Their fix was to make "-1" be treated the same as "0", as
    older podman versions insisted on "0" for unlimited and "-1" either
    never worked or stopped working a long time ago.  docker seems to
    accept both "-1" and "0" for unlimited.

    The best of course of action would probably be to drop [4] from
    quincy,
    getting it back to 17.1.0 state (i.e. no --pids-limit option in sight)
    and amend the original --pids-limit change in master so that it works
    for all versions of podman.  The podman version is already checked in
    a couple of places (e.g. CGROUPS_SPLIT_PODMAN_VERSION) so it should be
    easy enough or we could just unconditionally pass "0" even though it
    is not documented anymore.

    (The reason for backporting [4] to quincy was to fix containerized
    iSCSI deployments where bumping into default PID limit is just a
    matter
    of scaling the number of exported LUNs.  It's been that way since the
    initial pacific release though so taking it out for now is completely
    acceptable.)

    [1] https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt
    [2] https://github.com/containers/podman/issues/11782
    [3] https://github.com/containers/podman/pull/11794
    [4] https://github.com/ceph/ceph/pull/45576

    Thanks,

                Ilya

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux