Re: Temporary shutdown of subcluster and cephfs

Frank Schilder <frans@xxxxxx> · Tue, 25 Oct 2022 07:48:21 +0000

Hi Patrick,

thanks for your answer. This is exactly the behaviour we need.

For future reference some more background:

We need to prepare a quite large installation for planned power outages. Even though they are called planned, we will not be able to handle these manually in good time for reasons irrelevant here. Our installation is protected by an UPS, but the guaranteed uptime on outage is only 6 minutes. So, we talk more about transient protection than uninterrupted power supply. Although we survived more than 20 minute power outages without loss of power to the DC, we need to plan with these 6 minutes.

In these 6 minutes, we need to wait for at least 1-2 minutes to avoid unintended shut-downs. In the remaining 4 minutes, we need to take down a 500 node HPC cluster and an 1000OSD+12MDS+2MON ceph sub-cluster. Part of this ceph cluster will continue running on another site with higher power redundancy. This gives maybe 1-2 minutes response time for the ceph cluster and the best we can do is to try to achieve a "consistent at rest" state and hope we can cleanly power down the system before the power is cut.

Why am I so concerned about a "consistent at rest" state?

Its because while not all instances of a power loss lead to data loss, all instances of data loss I know of and were not caused by admin errors were caused by a power loss (see https://tracker.ceph.com/issues/46847). We were asked to prepare for a worst case of weekly power cuts, so no room for taking too many chances here. Our approach is: unmount as much as possible, fail the quickly FS to stop all remaining IO, give OSDs and MDSes a chance to flush pending operations to disk or journal and then try a clean shut down.

I will also have to temporarily adjust a number of parameters to ensure that the remaining sub-cluster continues to operate as normal as possible, for example, handles OSD fails in the usual way despite 90% of OSDs being down already.

Thanks for your input and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: 24 October 2022 20:01:01
To: Frank Schilder
Cc: Dan van der Ster; ceph-users@xxxxxxx
Subject: Re:  Re: Temporary shutdown of subcluster and cephfs

On Wed, Oct 19, 2022 at 7:54 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan,
>
> I know that "fs fail ..." is not ideal, but we will not have time for a clean "fs down true" and wait for journal flush procedure to complete (on our cluster this takes at least 20 minutes, which is way too long). My question is more along the lines 'Is an "fs fail" destructive?'

It is not but lingering clients will not be evicted automatically by
the MDS. If you can, unmount before doing `fs fail`.

A journal flush is not really necessary. You only should wait ~10
seconds after the last client unmounts to give the MDS time to write
out to its journal any outstanding events.

> , that is, will an FS come up again after
>
> - fs fail
> ...
> - fs set <fs_name> joinable true

Yes.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx