On 10/6/22 17:22, Frank Schilder wrote:
Just for the general audience. In the past we did cluster maintenance by setting "ceph fs set FS down true" (freezing all client IO in D-state), waited for all MDSes becoming standby and doing the job. After that, we set "ceph fs set FS down false", the MDSes started again, all clients connected more or less instantaneously and continued exactly at the point where they were frozen. This time, a huge number of clients just crashed instead of freezing and of the few ones that remained up only a small number reconnected. This is in our experience very unusual behaviour. Was there a change or are we looking at a potential bug here?
There is a strict MDS maintenance dance you have to perform [1]. In order to avoid MDS committing suicide for example. We would just have the last remaining "up:active" MDS restart. And as soon as it became up:active again all clients would reconnect virtually instantly. Even if rejoining had taken 3.5 minutes. Especially for MDSes I would not deviate from best practices.
Gr. Stefan [1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/ _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx