Re: Should OSD write error result in damaged filesystem?

John Spray <jspray@xxxxxxxxxx> · Sun, 4 Nov 2018 18:14:57 +0000

On Sat, Nov 3, 2018 at 7:28 PM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
>
> I had a filesystem rank get damaged when the MDS had an error writing the log
> to the OSD.  Is damage expected when a log write fails?
>
> According to log messages, an OSD write failed because the MDS attempted
> to write a bigger chunk than the OSD's maximum write size.  I can probably
> figure out why that happened and fix it, but OSD write failures can happen for
> lots of reasons, and I would have expected the MDS just to discard the recent
> filesystem updates, issue a log message, and keep going.  The user had
> presumably not been told those updates were committed.

The MDS will go into a damaged state when it sees an unexpected error
from an OSD (the key word there is "unexpected", this does not apply
to ordinary behaviour such as an OSD going down).  In this case, it
doesn't mean that the metadata is literally damaged, just that the MDS
has encountered a situation that it can't handle, and needs to be
stopped until a human being can intervene to sort the situation out.

OSD write errors are not usual events: any issues with the underlying
storage are expected to be handled by RADOS, and write operations to
an unhealthy cluster should block, rather than returning an error.  It
would not be correct for CephFS to throw away metadata updates in the
case of unexpected write errors -- this is a strongly consistent
system, so when we can't make progress consistently (i.e. respecting
all the ops we've seen in order), then we have to stop.

Assuming that the only problem was indeed that the MDS's journaler was
attempting to exceed the OSD's maximum write size, then you should
find that doing a "ceph mds repaired..." to clear the damaged flag
will allow the MDS to start again.

I'm guessing that you changed some related settings (like
mds_log_segment_size) to get into this situation?  Otherwise, an error
like this would definitely be a bug.

John

>
>
> And how do I repair this now?  Is this a job for
>
>   cephfs-journal-tool event recover_dentries
>   cephfs-journal-tool journal reset
>
> ?
>
> This is Jewel.
>
> --
> Bryan Henderson                                   San Jose, California
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com