Re: Should OSD write error result in damaged filesystem?

John Spray <jspray@xxxxxxxxxx> · Mon, 5 Nov 2018 10:58:37 +0000

On Sun, Nov 4, 2018 at 10:24 PM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
>
> >OSD write errors are not usual events: any issues with the underlying
> >storage are expected to be handled by RADOS, and write operations to
> >an unhealthy cluster should block, rather than returning an error.  It
> >would not be correct for CephFS to throw away metadata updates in the
> >case of unexpected write errors -- this is a strongly consistent
> >system, so when we can't make progress consistently (i.e. respecting
> >all the ops we've seen in order), then we have to stop.
>
> Thank you for that explanation; that all makes sense.  I have to get used to
> the idea of responding to broken storage by waiting indefinitely until it is
> isn't broken.  I wasn't thinking in those terms.
>
> >I'm guessing that you changed some related settings (like
> >mds_log_segment_size) to get into this situation?  Otherwise, an error
> >like this would definitely be a bug.
>
> What I changed (from default) was osd_max_write_size.  I set it to its legal
> minimum, 1M.  I've discovered that there are clients all around that expect to
> be able to write 4M and don't respond nicely when they can't.  Rather than try
> to find and change them all, I'm going to capitulate and go ahead and make
> osd_max_write_size 4M.
>
> Does manually tuning every client to make it consistent with the OSD's maximum
> write size have to be what avoids crashes like this?  It sure would be nice if
> an MDS could detect much earlier that the log is on an OSD that's incapable of
> hosting that log.  But I found the filesystem driver is the same way - I have
> to tell it how big a write it can do; it can't figure it out from the OSDs.
> So maybe its a fundamental architecture thing.

Settings like osd_max_write_size are meant as sanity checks to reject
"unreasonably large" operations, rather than as policy to control the
actual size of operations.  That's not made particularly clear in the
documentation though, and arguably we should not let people set this
value below 4M (no OSD backend should have a problem with 4M writes)
-- https://github.com/ceph/ceph/pull/24929

Having the MDS automatically notice if the object size in a layout
exceeds the OSD's max would be a useful check, but it's awkward to
implement because of the way Ceph's config system works -- the daemons
can have different ideas of what a particular setting's value is, so
the MDS can't necessarily see what the real OSD-side value of a
setting is.

John

>
> --
> Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com