Hi Stefan and anyone else reading this, we are probably misunderstanding each other here: > There is a strict MDS maintenance dance you have to perform [1]. > ... > [1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/ Our ceph fs shut-down was *after* completing the upgrade to octopus, *not part of it*. We are not in the middle of the upgrade procedure [1], we are done with it (with the bluestore_fsck_quick_fix_on_mount = false version). As I explained at the beginning of this thread, all our daemons are on octopus: # ceph versions { "mon": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "mgr": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "osd": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1046 }, "mds": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 12 }, "overall": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1068 } } and the upgrade has been finalised with ceph osd require-osd-release octopus and enabling v2 for the monitors. The conversion I'm talking about happens *after* the complete upgrade, at which point I would expect the system to behave normal. This includes FS maintenance, shut down and startup. Ceph fs clients should not crash on "ceph fs set XYZ down true", they should freeze. Etc. Its just the omap conversion that was postponed to post-upgrade as explained in [1], nothing else. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Stefan Kooman <stefan@xxxxxx> Sent: 06 October 2022 18:12:09 To: Frank Schilder; Igor Fedotov; ceph-users@xxxxxxx Subject: Re: OSD crashes during upgrade mimic->octopus On 10/6/22 17:22, Frank Schilder wrote: > > Just for the general audience. In the past we did cluster maintenance by setting "ceph fs set FS down true" (freezing all client IO in D-state), waited for all MDSes becoming standby and doing the job. After that, we set "ceph fs set FS down false", the MDSes started again, all clients connected more or less instantaneously and continued exactly at the point where they were frozen. > > This time, a huge number of clients just crashed instead of freezing and of the few ones that remained up only a small number reconnected. This is in our experience very unusual behaviour. Was there a change or are we looking at a potential bug here? There is a strict MDS maintenance dance you have to perform [1]. In order to avoid MDS committing suicide for example. We would just have the last remaining "up:active" MDS restart. And as soon as it became up:active again all clients would reconnect virtually instantly. Even if rejoining had taken 3.5 minutes. Especially for MDSes I would not deviate from best practices. Gr. Stefan [1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/ _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx