Re: OSD crashes during upgrade mimic->octopus

Stefan Kooman <stefan@xxxxxx> · Thu, 6 Oct 2022 18:12:09 +0200

On 10/6/22 17:22, Frank Schilder wrote:

Just for the general audience. In the past we did cluster maintenance by setting "ceph fs set FS down true" (freezing all client IO in D-state), waited for all MDSes becoming standby and doing the job. After that, we set "ceph fs set FS down false", the MDSes started again, all clients connected more or less instantaneously and continued exactly at the point where they were frozen.

This time, a huge number of clients just crashed instead of freezing and of the few ones that remained up only a small number reconnected. This is in our experience very unusual behaviour. Was there a change or are we looking at a potential bug here?

There is a strict MDS maintenance dance you have to perform [1]. In 
order to avoid MDS committing suicide for example. We would just have 
the last remaining "up:active" MDS restart. And as soon as it became 
up:active again all clients would reconnect virtually instantly. Even if 
rejoining had taken 3.5 minutes. Especially for MDSes I would not 
deviate from best practices.

Gr. Stefan

[1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx