Re: Temporary shutdown of subcluster and cephfs

Frank Schilder <frans@xxxxxx> · Wed, 19 Oct 2022 11:53:47 +0000

Hi Dan,

I know that "fs fail ..." is not ideal, but we will not have time for a clean "fs down true" and wait for journal flush procedure to complete (on our cluster this takes at least 20 minutes, which is way too long). My question is more along the lines 'Is an "fs fail" destructive?', that is, will an FS come up again after

- fs fail
...
- fs set <fs_name> joinable true

The alternative is just a power-off without consideration to anything. Of course we try to get many FS clients unmounted before that, but there is no time to wait for anything that takes too long. I need a fast (unclean yet recoverable) procedure. Maybe data in flight gets lost, but the FS itself must come up healthy again.

Any hints on how to do this? Also for the MON store log size problem?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dvanders@xxxxxxxxx>
Sent: 19 October 2022 13:27:11
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Temporary shutdown of subcluster and cephfs

Hi Frank,

fs fail isn't ideal -- there's an 'fs down' command for this.

Here's a procedure we used, last used in the nautilus days:

1. If possible, umount fs from all the clients, so that all dirty
pages are flushed.
2. Prepare the ceph cluster: ceph osd set noout/noin
3. Wait until there is zero IO on the cluster, unmount any leftover clients.
4. ceph fs set cephfs down true
5. Stop all the ceph-osd's.
6. Power off the cluster.
(At this point we had only the ceph-mon's ceph-mgr's running -- you
can shut those down too).
7. Power on the cluster, wait for mon/mgr/osds/mds to power-up.
8. ceph fs set cephfs down false
9. Reconnect and test clients.
10. ceph osd unset noout/noin

-- Dan

On Wed, Oct 19, 2022 at 12:43 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi all,
>
> we need to prepare for temporary shut-downs of a part of our ceph cluster. I have 2 questions:
>
> 1) What is the recommended procedure to temporarily shut down a ceph fs quickly?
> 2) How to avoid MON store log spam overflow (on octopus 15.2.17)?
>
> To 1: Currently, I'm thinking about:
>
> - fs fail <fs-name>
> - shut down all MDS daemons
> - shut down all OSDs in that sub-cluster
> - shut down MGRs and MONs in that sub-cluster
> - power servers down
> - mark out OSDs manually (the number will exceed the MON limit for auto-out)
>
> - power up
> - wait a bit
> - do I need to mark OSDs in again or will they join automatically after manual out and restart (maybe just temporarily increase the MON limit at end of procedure above)?
> - fs set <fs_name> joinable true
>
> Is this a safe procedure? The documentation calls this a procedure for "Taking the cluster down rapidly for deletion or disaster recovery", neither of the two is our intent. We need to have a fast *reversable* procedure, because an "fs set down true" simply takes too long.
>
> There will be ceph fs clients remaining up. Desired behaviour is that client-IO stalls until fs comes back up and then just continues as if nothing had happened.
>
> To 2: We will have a sub-cluster down for an extended period of time. There have been cases where such a situation killed MONS due to excessive amount of non-essential logs accumulating in the MON store. Is this still a problem with 15.2.17 and what can I do to reduce this problem?
>
> Thanks for any hints/corrections/confirmations!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx