I think that any outcome possible under normal operation (abnormal poweroff is normal) should not prevent the OSD from booting. -Sam On Wed, Aug 18, 2021 at 11:57 PM Adam Kupczyk <akupczyk@xxxxxxxxxx> wrote: > > Hi, > > I think in our policy of writing the log, we have no way to distinguish between legit end-of-log > caused by abnormal poweroff and by data corruption. > With exception of being requested to read more data but having no more extents. > > My thinking would be that such error should be fatal for OSD reboot, > but acceptable for `bluestore-tool fsck`. > Another approach would be to accept crc failure if size of transaction was larger > then bdev_block_size, but for smaller then that treat as fatal errors. > > In addition I strongly believe that with new OP_FILE_UPDATE_INC we will almost never have > more then 4K of single log transaction. > > Best regards, > Adam > > On Mon, 16 Aug 2021 at 13:29, Igor Fedotov <ifedotov@xxxxxxx> wrote: >> >> Hi folks, >> >> the preface: >> >> This email has been inspired by the following ticket: >> https://tracker.ceph.com/issues/52079 >> >> which reports about two apparently similar bluefs corruptions caused by >> unexpected power off. >> >> In the both cases BlueFS finally refuses to replay the resulting log >> with pretty the same pattern: >> >> - read the next 4K block (op header), >> >> - then do more block readings (additional 8K, apparently due to bluefs >> op/transaction being longer than 4K) >> >> - which finally causes faulty replay termination due to transaction (not >> op header!) checksum failure (both times 'expected checksum', i.e. the >> one read from disk in the last(!) block, is equal to 0). >> >> Here is the sample: >> >> "debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h >> 0x559f918f7f80 0xb5000~1000 from file(ino 1 size 0xb5000 mtime 0.000000 >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000]) >> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left >> 0x5b000 len 0x1000 >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got >> 4096debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _replay >> need 0x2000 more bytes >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h >> 0x559f918f7f80 0xb6000~2000 from file(ino 1 size 0xb5000 mtime 0.000000 >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000]) >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left >> 0x5a000 len 0x2000 >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got 8192 >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs _replay >> 0xb5000: stop: failed to decode: bad crc 1492738775 expected 0: >> Malformed input >> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs mount failed >> to replay log: (5) Input/output error >> >> " >> >> The above makes me think about the incomplete multi-block transaction >> writing happened during power off. This state should be fine for recover >> IMO but it looks like we don't have proper handling in log replay for >> this case. Shouldn't we treat transaction decoding failure (multi-block >> one only?) as a good stop condition, not as unrecoverable one we have >> for now: >> >> bluefs_transaction_t t; >> try { >> auto p = bl.cbegin(); >> decode(t, p); >> } >> catch (ceph::buffer::error& e) { >> derr << __func__ << " 0x" << std::hex << pos << std::dec >> << ": stop: failed to decode: " << e.what() >> << dendl; >> delete log_reader; >> return -EIO; >> } >> >> What do you think? Is the above analysis valid? >> >> Any other thoughts? >> >> >> Thanks, >> >> Igor >> > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx