On Mon, Apr 26, 2021 at 8:09 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Mon, Apr 26, 2021 at 4:04 AM <hase.jin@xxxxxxxxxxx> wrote: > > > > Hi Sage, > > > > > If the medium error you got led to an incomplete readdir() > > result from XFS, then Ceph doesn't try to cope with that. > > > > Do you mean this behavior is in Ceph specifications? > > It is a problem that data loss actually occurs, so I think we need to solve that. > > I would frame it like this: > > - With FileStore, Ceph assumed that XFS would return results we could > trust (i.e., it would not silently skip files). Trusting XFS turned > out to be a bad idea, and not just because of readdir--we also > couldn't trust that any data returned by XFS was correct since XFS > does not do any sort of data checksums. > - We replaced FileStore with BlueStore, which checksums both metadata > and data, solving this entire class of problems. > > The "fix" in this case is to replace your FileStore OSDs with > BlueStore. This particular backfill corner case is just one of many > bad things that can happen with FileStore and media errors. I agree, bluestore handles such errors in a much better way than filestore and we have further improvements in the pipeline like https://trello.com/c/pWbCyYsz/614-bluestore-make-asserts-unique-per-return-value, which will help distinguish issues with the underlying layer more easily. Neha > > sage > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx