Re: what happens if a server crashes with cephfs?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 8 Dec 2022 09:57:24 -0800

Ceph clients keep updates buffered until they receive server
notification that the update is persisted to disk. On server crash,
the client connects to either the newly-responsible OSD (for file
data) or the standby/restarted MDS (for file metadata) and replays
outstanding operations. This is all transparent to the application or
filesystem users, except that IO calls may take a very long time to
resolve (if you don't have an MDS running for ten minutes, metadata
synchronization calls will just hang for those ten minutes).
Ceph is designed to run operationally when failures happen, and
"failures" include upgrades. You may see degraded capacity, but the
filesystem remains online and no IO errors will be returned to
applications as a result.[1] The whole point of data replication and
Ceph's architecture is to prevent the sort of IO error you might get
when a Lustre target dies.

So, no: your applications will not receive IO errors because an OSD fails. :)
-Greg

[1]: In a default configuration. Some admins/environments prefer to
receive error codes rather than hangs on arbitrary syscalls, and there
are some limited accommodations for them which can be set with config
options.

On Thu, Dec 8, 2022 at 9:52 AM Charles Hedrick <hedrick@xxxxxxxxxxx> wrote:
>
> I'm aware that the file system will remain available. My concern is about long jobs using it failing because a single operation returns an error. While none of the discussion so far has been explicit, I assume this can happen if an OSD fails, since it might have done an async acknowledgement for an operation that won't actually be possible to complete. I'm assuming that it won't happen during a cephadm upgrade.
> ________________________________
> From: Manuel Holtgrewe <zyklenfrei@xxxxxxxxx>
> Sent: Thursday, December 8, 2022 12:38 PM
> To: Charles Hedrick <hedrick@xxxxxxxxxxx>
> Cc: Gregory Farnum <gfarnum@xxxxxxxxxx>; Dhairya Parmar <dparmar@xxxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Subject: Re:  Re: what happens if a server crashes with cephfs?
>
> Hi Charles,
>
> are you concerned with a single Ceph cluster server crash or the whole server crashing? If you have sufficient redundancy, nothing bad should happen but the file system should remain available. The same should be true if you perform an upgrade in the "correct" way, e.g., through the cephadm commands.
>
> The folks over at 45 drives made a little show of tearing down a ceph cluster bit by bit while it is running:
>
> https://www.youtube.com/watch?v=8paAkGx2_OA
>
> Cheers,
> Manuel
>
> On Thu, Dec 8, 2022 at 6:34 PM Charles Hedrick <hedrick@xxxxxxxxxxx> wrote:
>
> network and local file systems have different requirements. If I have a long job and the machine I'm running on crashes, I have to rerun it. The fact that the last 500 msec of data didn't get flushed to disk is unlikely to matter.
>
> If I have a long job using a network file system, and the server crashes, my job itself doesn't crash. You really want it to continue after the server reboots without any errors. It's true that you could return an error for write or close, and the job could detect that and either rewrite the file or exit. However a very large amount of code is written for local files, and doesn't check errors for write and close.
>
> I don't actually know how our long jobs would behave if a close fails. Perhaps it's OK. It's mostly python. Presumably the python interpreter would throw an I/O error.
>
> A related question: what is likely to happen when you do a version upgrade? Is that done in a way that won't generate errors in user code?
>
> ________________________________
> From: Gregory Farnum <gfarnum@xxxxxxxxxx>
> Sent: Thursday, December 8, 2022 11:44 AM
> To: Manuel Holtgrewe <zyklenfrei@xxxxxxxxx>
> Cc: Charles Hedrick <hedrick@xxxxxxxxxxx>; Dhairya Parmar <dparmar@xxxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Subject: Re:  Re: what happens if a server crashes with cephfs?
>
> On Thu, Dec 8, 2022 at 8:42 AM Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote:
> >
> > Hi Charles,
> >
> > as far as I know, CephFS implements POSIX semantics. That is, if the CephFS server cluster dies for whatever reason then this will translate in I/O errors. This is the same as if your NFS server dies or you run the program locally on a workstation/laptop and the machine loses power. POSIX file systems guarantee that data is persisted on the storage after a file is closed
>
> Actually the "commit on close" is *entirely* an NFS-ism and is not
> part of posix. If you expect a closed file to be flushed to disk
> anywhere else (including CephFS), you will be disappointed. You need
> to use fsync/fdatasync/sync/syncfs.
> -Greg
>
> > or fsync() is called. Otherwise, the data may still be "in flight", e.g., in the OS I/O cache or even the runtime library's cache.
> >
> > This is not a bug but a feature as this improves performance when appending small bits to a file and the HDD head does not have to move every time something is written and not a full 4kb block has to be written for SSD.
> >
> > Posix semantics even go further, enforcing certain guarantees if files are written from multiple clients. Recently, something called "lazy I/O" has been introduced [1] in CephFS which allows to explicitly relax certain of these guarantees to improve performance.
> >
> > I don't think there even is a ceph mount setting that allows you to configure local cache mechanisms as for NFS. For NFS, I have seen setups where two clients saw two different versions of the same -- closed -- file because one had written to the file and this was not yet reflected on the second client. To the best of my knowledge, this will not happen with CephFS.
> >
> > I'd be happy to learn to be wrong if I'm wrong. ;-)
> >
> > Best wishes,
> > Manuel
> >
> > [1] https://docs.ceph.com/en/latest/cephfs/lazyio/
> >
> > On Thu, Dec 8, 2022 at 5:09 PM Charles Hedrick <hedrick@xxxxxxxxxxx> wrote:
> >>
> >> thanks. I'm evaluating cephfs for a computer science dept. We have users that run week-long AI training jobs. They use standard packages, which they probably don't want to modify. At the moment we use NFS. It uses synchronous I/O, so if somethings goes wrong, the users' jobs pause until we reboot, and then continue. However there's an obvious performance penalty for this.
> >> ________________________________
> >> From: Gregory Farnum <gfarnum@xxxxxxxxxx>
> >> Sent: Thursday, December 8, 2022 2:08 AM
> >> To: Dhairya Parmar <dparmar@xxxxxxxxxx>
> >> Cc: Charles Hedrick <hedrick@xxxxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> >> Subject: Re:  Re: what happens if a server crashes with cephfs?
> >>
> >> More generally, as Manuel noted you can (and should!) make use of fsync et al for data safety. Ceph’s async operations are not any different at the application layer from how data you send to the hard drive can sit around in volatile caches until a consistency point like fsync is invoked.
> >> -Greg
> >>
> >> On Wed, Dec 7, 2022 at 10:02 PM Dhairya Parmar <dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>> wrote:
> >> Hi Charles,
> >>
> >> There are many scenarios where the write/close operation can fail but
> >> generally
> >> failures/errors are logged (normally every time) to help debug the case.
> >> Therefore
> >> there are no silent failures as such except you encountered  a very rare
> >> bug.
> >> - Dhairya
> >>
> >>
> >> On Wed, Dec 7, 2022 at 11:38 PM Charles Hedrick <hedrick@xxxxxxxxxxx<mailto:hedrick@xxxxxxxxxxx>> wrote:
> >>
> >> > I believe asynchronous operations are used for some operations in cephfs.
> >> > That means the server acknowledges before data has been written to stable
> >> > storage. Does that mean there are failure scenarios when a write or close
> >> > will return an error? fail silently?
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
> >> >
> >> >
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx