Re: Temporary shutdown of subcluster and cephfs

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Tue, 25 Oct 2022 08:51:33 -0400

On Tue, Oct 25, 2022 at 3:48 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Patrick,
>
> thanks for your answer. This is exactly the behaviour we need.
>
> For future reference some more background:
>
> We need to prepare a quite large installation for planned power outages. Even though they are called planned, we will not be able to handle these manually in good time for reasons irrelevant here. Our installation is protected by an UPS, but the guaranteed uptime on outage is only 6 minutes. So, we talk more about transient protection than uninterrupted power supply. Although we survived more than 20 minute power outages without loss of power to the DC, we need to plan with these 6 minutes.
>
> In these 6 minutes, we need to wait for at least 1-2 minutes to avoid unintended shut-downs. In the remaining 4 minutes, we need to take down a 500 node HPC cluster and an 1000OSD+12MDS+2MON ceph sub-cluster. Part of this ceph cluster will continue running on another site with higher power redundancy. This gives maybe 1-2 minutes response time for the ceph cluster and the best we can do is to try to achieve a "consistent at rest" state and hope we can cleanly power down the system before the power is cut.
>
> Why am I so concerned about a "consistent at rest" state?
>
> Its because while not all instances of a power loss lead to data loss, all instances of data loss I know of and were not caused by admin errors were caused by a power loss (see https://tracker.ceph.com/issues/46847). We were asked to prepare for a worst case of weekly power cuts, so no room for taking too many chances here. Our approach is: unmount as much as possible, fail the quickly FS to stop all remaining IO, give OSDs and MDSes a chance to flush pending operations to disk or journal and then try a clean shut down.

To be clear in case there is any confusion: once you do `fs fail`, the
MDS are removed from the cluster and they will respawn. They are not
given any time to flush remaining I/O.

FYI as this may interest you: we have a ticket to set a flag on the
file system to prevent new client mounts:
https://tracker.ceph.com/issues/57090

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx