Temporary shutdown of subcluster and cephfs

Frank Schilder <frans@xxxxxx> · Wed, 19 Oct 2022 10:43:02 +0000

Hi all,

we need to prepare for temporary shut-downs of a part of our ceph cluster. I have 2 questions:

1) What is the recommended procedure to temporarily shut down a ceph fs quickly?
2) How to avoid MON store log spam overflow (on octopus 15.2.17)?

To 1: Currently, I'm thinking about:

- fs fail <fs-name>
- shut down all MDS daemons
- shut down all OSDs in that sub-cluster
- shut down MGRs and MONs in that sub-cluster
- power servers down
- mark out OSDs manually (the number will exceed the MON limit for auto-out)

- power up
- wait a bit
- do I need to mark OSDs in again or will they join automatically after manual out and restart (maybe just temporarily increase the MON limit at end of procedure above)?
- fs set <fs_name> joinable true

Is this a safe procedure? The documentation calls this a procedure for "Taking the cluster down rapidly for deletion or disaster recovery", neither of the two is our intent. We need to have a fast *reversable* procedure, because an "fs set down true" simply takes too long.

There will be ceph fs clients remaining up. Desired behaviour is that client-IO stalls until fs comes back up and then just continues as if nothing had happened.

To 2: We will have a sub-cluster down for an extended period of time. There have been cases where such a situation killed MONS due to excessive amount of non-essential logs accumulating in the MON store. Is this still a problem with 15.2.17 and what can I do to reduce this problem?

Thanks for any hints/corrections/confirmations!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx