Re: Temporary shutdown of subcluster and cephfs

Frank Schilder <frans@xxxxxx> · Tue, 25 Oct 2022 13:21:28 +0000

Hi Patrick.

> To be clear in case there is any confusion: once you do `fs fail`, the
> MDS are removed from the cluster and they will respawn. They are not
> given any time to flush remaining I/O.

This is fine, there is not enough time to flush anything. As long as they leave the meta-data- and data pools in a consistent state, that is, after an "fs set <fs_name> joinable true" the MDSes start replaying the journal etc. and the FS comes up healthy, everything is fine. If user IO in flight gets lost in this process, this is not a problem. A problem would be a corruption of the file system itself.

In my experience, an mds fail is a clean (non-destructive) operation. I have never had an FS corruption due to an mds fail. As long as an "fs fail" is also non-destructive, it is the best way I can see to cut off all user IO as fast as possible and bring all hardware to rest. What I would like to avoid is a power loss on a busy cluster where I would have to rely on too many things to be implemented correctly. With >800 disks you start seeing unusual firmware fails and also disk fails after power up are not uncommon. I just want to take as much as possible out of the "does this really work in all corner cases" equation and rather rely on "I did this 100 times in the past without a problem" situations.

That users may have to repeat a task is not a problem. Damaging the file system itself, on the other hand, is.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: 25 October 2022 14:51:33
To: Frank Schilder
Cc: Dan van der Ster; ceph-users@xxxxxxx
Subject: Re:  Re: Temporary shutdown of subcluster and cephfs

On Tue, Oct 25, 2022 at 3:48 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Patrick,
>
> thanks for your answer. This is exactly the behaviour we need.
>
> For future reference some more background:
>
> We need to prepare a quite large installation for planned power outages. Even though they are called planned, we will not be able to handle these manually in good time for reasons irrelevant here. Our installation is protected by an UPS, but the guaranteed uptime on outage is only 6 minutes. So, we talk more about transient protection than uninterrupted power supply. Although we survived more than 20 minute power outages without loss of power to the DC, we need to plan with these 6 minutes.
>
> In these 6 minutes, we need to wait for at least 1-2 minutes to avoid unintended shut-downs. In the remaining 4 minutes, we need to take down a 500 node HPC cluster and an 1000OSD+12MDS+2MON ceph sub-cluster. Part of this ceph cluster will continue running on another site with higher power redundancy. This gives maybe 1-2 minutes response time for the ceph cluster and the best we can do is to try to achieve a "consistent at rest" state and hope we can cleanly power down the system before the power is cut.
>
> Why am I so concerned about a "consistent at rest" state?
>
> Its because while not all instances of a power loss lead to data loss, all instances of data loss I know of and were not caused by admin errors were caused by a power loss (see https://tracker.ceph.com/issues/46847). We were asked to prepare for a worst case of weekly power cuts, so no room for taking too many chances here. Our approach is: unmount as much as possible, fail the quickly FS to stop all remaining IO, give OSDs and MDSes a chance to flush pending operations to disk or journal and then try a clean shut down.

To be clear in case there is any confusion: once you do `fs fail`, the
MDS are removed from the cluster and they will respawn. They are not
given any time to flush remaining I/O.

FYI as this may interest you: we have a ticket to set a flag on the
file system to prevent new client mounts:
https://tracker.ceph.com/issues/57090

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx