> But the problem is that we lack ability to distinguish between data corruption (abnormal), > and poweroff (normal). > It will be bad if we encountered data error during replay of log and assumed that we only lost last transaction, > ignored it because it was not stable anyway and continued; while in reality there were many more transactions > thought to be stable on disk. This is a pretty basic element of this commit protocol. We are relying on the disk semantics to ensure that if a set of writes followed by a barrier completes, those writes must be readable later on. The case where there are incomplete transactions at the end of the log is expected in the event of a powercycle. IMO, it's not reasonable to expect an operator to interpret or correct such behavior since, again, we expect it. Greg's suggestion provides a way to dramatically narrow the undetectable corruption window, but such a window a necessary component I think of this commit protocol. If we have a reasonably quick fsck we can run automatically, I think that would be reasonable, but I don't think it's reasonable to expose this to an operator. -Sam On Fri, Aug 20, 2021 at 2:54 AM Adam Kupczyk <akupczyk@xxxxxxxxxx> wrote: > > > I think that any outcome possible under normal operation (abnormal > > poweroff is normal) should not prevent the OSD from booting. > > -Sam > > That is correct. > But the problem is that we lack ability to distinguish between data corruption (abnormal), > and poweroff (normal). > It will be bad if we encountered data error during replay of log and assumed that we only lost last transaction, > ignored it because it was not stable anyway and continued; while in reality there were many more transactions > thought to be stable on disk. > > Adam > > On Thu, 19 Aug 2021 at 22:07, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> >> On Thu, Aug 19, 2021 at 12:04 PM Sam Just <sjust@xxxxxxxxxx> wrote: >> > >> > I think that any outcome possible under normal operation (abnormal >> > poweroff is normal) should not prevent the OSD from booting. >> > -Sam >> >> Yeah. >> >> > On Wed, Aug 18, 2021 at 11:57 PM Adam Kupczyk <akupczyk@xxxxxxxxxx> wrote: >> > > >> > > Hi, >> > > >> > > I think in our policy of writing the log, we have no way to distinguish between legit end-of-log >> > > caused by abnormal poweroff and by data corruption. >> > > With exception of being requested to read more data but having no more extents. >> >> Surely we record a known-committed transaction ID somewhere? (In fact >> I'd assume we use such a value to know where to start replay from? If >> not:) If we are replaying the log and the transaction value is lower >> than the known-committed, we should error out because obviously the >> disk has lost data in some fashion. But otherwise, an error >> overwhelmingly means the transaction didn't get fully committed, and >> that means it was never written out of the log, so we should be good >> to go. >> -Greg >> >> >> > > >> > > My thinking would be that such error should be fatal for OSD reboot, >> > > but acceptable for `bluestore-tool fsck`. >> > > Another approach would be to accept crc failure if size of transaction was larger >> > > then bdev_block_size, but for smaller then that treat as fatal errors. >> > > >> > > In addition I strongly believe that with new OP_FILE_UPDATE_INC we will almost never have >> > > more then 4K of single log transaction. >> > > >> > > Best regards, >> > > Adam >> > > >> > > On Mon, 16 Aug 2021 at 13:29, Igor Fedotov <ifedotov@xxxxxxx> wrote: >> > >> >> > >> Hi folks, >> > >> >> > >> the preface: >> > >> >> > >> This email has been inspired by the following ticket: >> > >> https://tracker.ceph.com/issues/52079 >> > >> >> > >> which reports about two apparently similar bluefs corruptions caused by >> > >> unexpected power off. >> > >> >> > >> In the both cases BlueFS finally refuses to replay the resulting log >> > >> with pretty the same pattern: >> > >> >> > >> - read the next 4K block (op header), >> > >> >> > >> - then do more block readings (additional 8K, apparently due to bluefs >> > >> op/transaction being longer than 4K) >> > >> >> > >> - which finally causes faulty replay termination due to transaction (not >> > >> op header!) checksum failure (both times 'expected checksum', i.e. the >> > >> one read from disk in the last(!) block, is equal to 0). >> > >> >> > >> Here is the sample: >> > >> >> > >> "debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h >> > >> 0x559f918f7f80 0xb5000~1000 from file(ino 1 size 0xb5000 mtime 0.000000 >> > >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000]) >> > >> >> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left >> > >> 0x5b000 len 0x1000 >> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got >> > >> 4096debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _replay >> > >> need 0x2000 more bytes >> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h >> > >> 0x559f918f7f80 0xb6000~2000 from file(ino 1 size 0xb5000 mtime 0.000000 >> > >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000]) >> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left >> > >> 0x5a000 len 0x2000 >> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got 8192 >> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs _replay >> > >> 0xb5000: stop: failed to decode: bad crc 1492738775 expected 0: >> > >> Malformed input >> > >> >> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs mount failed >> > >> to replay log: (5) Input/output error >> > >> >> > >> " >> > >> >> > >> The above makes me think about the incomplete multi-block transaction >> > >> writing happened during power off. This state should be fine for recover >> > >> IMO but it looks like we don't have proper handling in log replay for >> > >> this case. Shouldn't we treat transaction decoding failure (multi-block >> > >> one only?) as a good stop condition, not as unrecoverable one we have >> > >> for now: >> > >> >> > >> bluefs_transaction_t t; >> > >> try { >> > >> auto p = bl.cbegin(); >> > >> decode(t, p); >> > >> } >> > >> catch (ceph::buffer::error& e) { >> > >> derr << __func__ << " 0x" << std::hex << pos << std::dec >> > >> << ": stop: failed to decode: " << e.what() >> > >> << dendl; >> > >> delete log_reader; >> > >> return -EIO; >> > >> } >> > >> >> > >> What do you think? Is the above analysis valid? >> > >> >> > >> Any other thoughts? >> > >> >> > >> >> > >> Thanks, >> > >> >> > >> Igor >> > >> >> > > _______________________________________________ >> > > Dev mailing list -- dev@xxxxxxx >> > > To unsubscribe send an email to dev-leave@xxxxxxx >> > >> > _______________________________________________ >> > Dev mailing list -- dev@xxxxxxx >> > To unsubscribe send an email to dev-leave@xxxxxxx >> > >> _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx