Re: bluefs transaction serialization atomicity.

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 19 Aug 2021 13:07:26 -0700

On Thu, Aug 19, 2021 at 12:04 PM Sam Just <sjust@xxxxxxxxxx> wrote:
>
> I think that any outcome possible under normal operation (abnormal
> poweroff is normal) should not prevent the OSD from booting.
> -Sam

Yeah.

> On Wed, Aug 18, 2021 at 11:57 PM Adam Kupczyk <akupczyk@xxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > I think in our policy of writing the log, we have no way to distinguish between legit end-of-log
> > caused by abnormal poweroff and by data corruption.
> > With exception of being requested to read more data but having no more extents.

Surely we record a known-committed transaction ID somewhere? (In fact
I'd assume we use such a value to know where to start replay from? If
not:) If we are replaying the log and the transaction value is lower
than the known-committed, we should error out because obviously the
disk has lost data in some fashion. But otherwise, an error
overwhelmingly means the transaction didn't get fully committed, and
that means it was never written out of the log, so we should be good
to go.
-Greg

> >
> > My thinking would be that such error should be fatal for OSD reboot,
> > but acceptable for `bluestore-tool fsck`.
> > Another approach would be to accept crc failure if size of transaction was larger
> > then bdev_block_size, but for smaller then that treat as fatal errors.
> >
> > In addition I strongly believe that with new OP_FILE_UPDATE_INC we will almost never have
> > more then 4K of single log transaction.
> >
> > Best regards,
> > Adam
> >
> > On Mon, 16 Aug 2021 at 13:29, Igor Fedotov <ifedotov@xxxxxxx> wrote:
> >>
> >> Hi folks,
> >>
> >> the preface:
> >>
> >> This email has been inspired by the following ticket:
> >> https://tracker.ceph.com/issues/52079
> >>
> >> which reports about two apparently similar bluefs corruptions caused by
> >> unexpected power off.
> >>
> >> In the both cases BlueFS finally refuses to replay the resulting log
> >> with pretty the same pattern:
> >>
> >> - read the next 4K block (op header),
> >>
> >> - then do more block readings (additional 8K, apparently due to bluefs
> >> op/transaction being longer than 4K)
> >>
> >> - which finally causes faulty replay termination due to transaction (not
> >> op header!) checksum failure (both times 'expected checksum', i.e. the
> >> one read from disk in the last(!) block, is equal to 0).
> >>
> >> Here is the sample:
> >>
> >> "debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h
> >> 0x559f918f7f80 0xb5000~1000 from file(ino 1 size 0xb5000 mtime 0.000000
> >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000])
> >>
> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left
> >> 0x5b000 len 0x1000
> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got
> >> 4096debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _replay
> >> need 0x2000 more bytes
> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h
> >> 0x559f918f7f80 0xb6000~2000 from file(ino 1 size 0xb5000 mtime 0.000000
> >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000])
> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left
> >> 0x5a000 len 0x2000
> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got 8192
> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs _replay
> >> 0xb5000: stop: failed to decode: bad crc 1492738775 expected 0:
> >> Malformed input
> >>
> >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs mount failed
> >> to replay log: (5) Input/output error
> >>
> >> "
> >>
> >> The above makes me think about the incomplete multi-block transaction
> >> writing happened during power off. This state should be fine for recover
> >> IMO but it looks like we don't have proper handling in log replay for
> >> this case. Shouldn't we treat transaction decoding failure (multi-block
> >> one only?) as a good stop condition, not as unrecoverable one we have
> >> for now:
> >>
> >>      bluefs_transaction_t t;
> >>      try {
> >>        auto p = bl.cbegin();
> >>        decode(t, p);
> >>      }
> >>      catch (ceph::buffer::error& e) {
> >>        derr << __func__ << " 0x" << std::hex << pos << std::dec
> >>             << ": stop: failed to decode: " << e.what()
> >>             << dendl;
> >>        delete log_reader;
> >>        return -EIO;
> >>      }
> >>
> >> What do you think? Is the above analysis valid?
> >>
> >> Any other thoughts?
> >>
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx