Best practices regarding MDS node restart

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Sat, 9 Sep 2023 21:38:51 +0800

Hello,

I am interested in the best-practice guidance for the following situation.

There is a Ceph cluster with CephFS deployed. There are three servers
dedicated to running MDS daemons: one active, one standby-replay, and one
standby. There is only a single rank.

Sometimes, servers need to be rebooted for reasons unrelated to Ceph.
What's the proper procedure to follow when restarting a node that currently
contains an active MDS server? The goal is to minimize the client downtime.
Ideally, they should not notice even if they play MP3s from the CephFS
filesystem (note that I haven't tested this exact scenario) - is this
achievable?

I tried to use the "ceph mds fail mds02" command while mds02 was active and
mds03 was standby-replay, to force the fail-over to mds03 so that I could
reboot mds02. Result: mds02 became standby, while mds03 went through
reconnect (30 seconds), rejoin (another 30 seconds), and replay (5 seconds)
phases. During the "reconnect" and "rejoin" phases, the "Activity" column
of "ceph fs status" is empty, which concerns me. It looks like I just
caused a 65-second downtime. After all of that, mds02 became
standby-replay, as expected.

Is there a better way? Or, should I have rebooted mds02 without much
thinking?

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx