Re: OSD crashes during upgrade mimic->octopus

Frank Schilder <frans@xxxxxx> · Thu, 6 Oct 2022 16:34:26 +0000

Hi Stefan and anyone else reading this, we are probably misunderstanding each other here:

> There is a strict MDS maintenance dance you have to perform [1].
> ...
> [1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/

Our ceph fs shut-down was *after* completing the upgrade to octopus, *not part of it*. We are not in the middle of the upgrade procedure [1], we are done with it (with the bluestore_fsck_quick_fix_on_mount = false version). As I explained at the beginning of this thread, all our daemons are on octopus:

# ceph versions
{
    "mon": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
    },
    "mgr": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
    },
    "osd": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1046
    },
    "mds": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 12
    },
    "overall": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1068
    }
}

and the upgrade has been finalised with

ceph osd require-osd-release octopus

and enabling v2 for the monitors.

The conversion I'm talking about happens *after* the complete upgrade, at which point I would expect the system to behave normal. This includes FS maintenance, shut down and startup. Ceph fs clients should not crash on "ceph fs set XYZ down true", they should freeze. Etc.

Its just the omap conversion that was postponed to post-upgrade as explained in [1], nothing else.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: 06 October 2022 18:12:09
To: Frank Schilder; Igor Fedotov; ceph-users@xxxxxxx
Subject: Re:  OSD crashes during upgrade mimic->octopus

On 10/6/22 17:22, Frank Schilder wrote:

>
> Just for the general audience. In the past we did cluster maintenance by setting "ceph fs set FS down true" (freezing all client IO in D-state), waited for all MDSes becoming standby and doing the job. After that, we set "ceph fs set FS down false", the MDSes started again, all clients connected more or less instantaneously and continued exactly at the point where they were frozen.
>
> This time, a huge number of clients just crashed instead of freezing and of the few ones that remained up only a small number reconnected. This is in our experience very unusual behaviour. Was there a change or are we looking at a potential bug here?

There is a strict MDS maintenance dance you have to perform [1]. In
order to avoid MDS committing suicide for example. We would just have
the last remaining "up:active" MDS restart. And as soon as it became
up:active again all clients would reconnect virtually instantly. Even if
rejoining had taken 3.5 minutes. Especially for MDSes I would not
deviate from best practices.

Gr. Stefan

[1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx