Hi Frank, fs fail isn't ideal -- there's an 'fs down' command for this. Here's a procedure we used, last used in the nautilus days: 1. If possible, umount fs from all the clients, so that all dirty pages are flushed. 2. Prepare the ceph cluster: ceph osd set noout/noin 3. Wait until there is zero IO on the cluster, unmount any leftover clients. 4. ceph fs set cephfs down true 5. Stop all the ceph-osd's. 6. Power off the cluster. (At this point we had only the ceph-mon's ceph-mgr's running -- you can shut those down too). 7. Power on the cluster, wait for mon/mgr/osds/mds to power-up. 8. ceph fs set cephfs down false 9. Reconnect and test clients. 10. ceph osd unset noout/noin -- Dan On Wed, Oct 19, 2022 at 12:43 PM Frank Schilder <frans@xxxxxx> wrote: > > Hi all, > > we need to prepare for temporary shut-downs of a part of our ceph cluster. I have 2 questions: > > 1) What is the recommended procedure to temporarily shut down a ceph fs quickly? > 2) How to avoid MON store log spam overflow (on octopus 15.2.17)? > > To 1: Currently, I'm thinking about: > > - fs fail <fs-name> > - shut down all MDS daemons > - shut down all OSDs in that sub-cluster > - shut down MGRs and MONs in that sub-cluster > - power servers down > - mark out OSDs manually (the number will exceed the MON limit for auto-out) > > - power up > - wait a bit > - do I need to mark OSDs in again or will they join automatically after manual out and restart (maybe just temporarily increase the MON limit at end of procedure above)? > - fs set <fs_name> joinable true > > Is this a safe procedure? The documentation calls this a procedure for "Taking the cluster down rapidly for deletion or disaster recovery", neither of the two is our intent. We need to have a fast *reversable* procedure, because an "fs set down true" simply takes too long. > > There will be ceph fs clients remaining up. Desired behaviour is that client-IO stalls until fs comes back up and then just continues as if nothing had happened. > > To 2: We will have a sub-cluster down for an extended period of time. There have been cases where such a situation killed MONS due to excessive amount of non-essential logs accumulating in the MON store. Is this still a problem with 15.2.17 and what can I do to reduce this problem? > > Thanks for any hints/corrections/confirmations! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx