Best practices regarding MDS node restart

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



I am interested in the best-practice guidance for the following situation.

There is a Ceph cluster with CephFS deployed. There are three servers
dedicated to running MDS daemons: one active, one standby-replay, and one
standby. There is only a single rank.

Sometimes, servers need to be rebooted for reasons unrelated to Ceph.
What's the proper procedure to follow when restarting a node that currently
contains an active MDS server? The goal is to minimize the client downtime.
Ideally, they should not notice even if they play MP3s from the CephFS
filesystem (note that I haven't tested this exact scenario) - is this

I tried to use the "ceph mds fail mds02" command while mds02 was active and
mds03 was standby-replay, to force the fail-over to mds03 so that I could
reboot mds02. Result: mds02 became standby, while mds03 went through
reconnect (30 seconds), rejoin (another 30 seconds), and replay (5 seconds)
phases. During the "reconnect" and "rejoin" phases, the "Activity" column
of "ceph fs status" is empty, which concerns me. It looks like I just
caused a 65-second downtime. After all of that, mds02 became
standby-replay, as expected.

Is there a better way? Or, should I have rebooted mds02 without much

Alexander E. Patrakov
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux