Re: bluefs transaction serialization atomicity.

Sam Just <sjust@xxxxxxxxxx> · Sat, 21 Aug 2021 00:05:20 -0700

> But the problem is that we lack ability to distinguish between data corruption (abnormal),
> and poweroff (normal).
> It will be bad if we encountered data error during replay of log and assumed that we only lost last transaction,
> ignored it because it was not stable anyway and continued; while in reality there were many more transactions
> thought to be stable on disk.

This is a pretty basic element of this commit protocol.  We are
relying on the disk semantics to ensure that if a set of writes
followed by a barrier completes, those writes must be readable later
on.  The case where there are incomplete transactions at the end of
the log is expected in the event of a powercycle.  IMO, it's not
reasonable to expect an operator to interpret or correct such behavior
since, again, we expect it.

Greg's suggestion provides a way to dramatically narrow the
undetectable corruption window, but such a window a necessary
component I think of this commit protocol.  If we have a reasonably
quick fsck we can run automatically, I think that would be reasonable,
but I don't think it's reasonable to expose this to an operator.
-Sam

On Fri, Aug 20, 2021 at 2:54 AM Adam Kupczyk <akupczyk@xxxxxxxxxx> wrote:
>
> > I think that any outcome possible under normal operation (abnormal
> > poweroff is normal) should not prevent the OSD from booting.
> > -Sam
>
> That is correct.
> But the problem is that we lack ability to distinguish between data corruption (abnormal),
> and poweroff (normal).
> It will be bad if we encountered data error during replay of log and assumed that we only lost last transaction,
> ignored it because it was not stable anyway and continued; while in reality there were many more transactions
> thought to be stable on disk.
>
> Adam
>
> On Thu, 19 Aug 2021 at 22:07, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>
>> On Thu, Aug 19, 2021 at 12:04 PM Sam Just <sjust@xxxxxxxxxx> wrote:
>> >
>> > I think that any outcome possible under normal operation (abnormal
>> > poweroff is normal) should not prevent the OSD from booting.
>> > -Sam
>>
>> Yeah.
>>
>> > On Wed, Aug 18, 2021 at 11:57 PM Adam Kupczyk <akupczyk@xxxxxxxxxx> wrote:
>> > >
>> > > Hi,
>> > >
>> > > I think in our policy of writing the log, we have no way to distinguish between legit end-of-log
>> > > caused by abnormal poweroff and by data corruption.
>> > > With exception of being requested to read more data but having no more extents.
>>
>> Surely we record a known-committed transaction ID somewhere? (In fact
>> I'd assume we use such a value to know where to start replay from? If
>> not:) If we are replaying the log and the transaction value is lower
>> than the known-committed, we should error out because obviously the
>> disk has lost data in some fashion. But otherwise, an error
>> overwhelmingly means the transaction didn't get fully committed, and
>> that means it was never written out of the log, so we should be good
>> to go.
>> -Greg
>>
>>
>> > >
>> > > My thinking would be that such error should be fatal for OSD reboot,
>> > > but acceptable for `bluestore-tool fsck`.
>> > > Another approach would be to accept crc failure if size of transaction was larger
>> > > then bdev_block_size, but for smaller then that treat as fatal errors.
>> > >
>> > > In addition I strongly believe that with new OP_FILE_UPDATE_INC we will almost never have
>> > > more then 4K of single log transaction.
>> > >
>> > > Best regards,
>> > > Adam
>> > >
>> > > On Mon, 16 Aug 2021 at 13:29, Igor Fedotov <ifedotov@xxxxxxx> wrote:
>> > >>
>> > >> Hi folks,
>> > >>
>> > >> the preface:
>> > >>
>> > >> This email has been inspired by the following ticket:
>> > >> https://tracker.ceph.com/issues/52079
>> > >>
>> > >> which reports about two apparently similar bluefs corruptions caused by
>> > >> unexpected power off.
>> > >>
>> > >> In the both cases BlueFS finally refuses to replay the resulting log
>> > >> with pretty the same pattern:
>> > >>
>> > >> - read the next 4K block (op header),
>> > >>
>> > >> - then do more block readings (additional 8K, apparently due to bluefs
>> > >> op/transaction being longer than 4K)
>> > >>
>> > >> - which finally causes faulty replay termination due to transaction (not
>> > >> op header!) checksum failure (both times 'expected checksum', i.e. the
>> > >> one read from disk in the last(!) block, is equal to 0).
>> > >>
>> > >> Here is the sample:
>> > >>
>> > >> "debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h
>> > >> 0x559f918f7f80 0xb5000~1000 from file(ino 1 size 0xb5000 mtime 0.000000
>> > >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000])
>> > >>
>> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left
>> > >> 0x5b000 len 0x1000
>> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got
>> > >> 4096debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _replay
>> > >> need 0x2000 more bytes
>> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h
>> > >> 0x559f918f7f80 0xb6000~2000 from file(ino 1 size 0xb5000 mtime 0.000000
>> > >> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000])
>> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left
>> > >> 0x5a000 len 0x2000
>> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got 8192
>> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs _replay
>> > >> 0xb5000: stop: failed to decode: bad crc 1492738775 expected 0:
>> > >> Malformed input
>> > >>
>> > >> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs mount failed
>> > >> to replay log: (5) Input/output error
>> > >>
>> > >> "
>> > >>
>> > >> The above makes me think about the incomplete multi-block transaction
>> > >> writing happened during power off. This state should be fine for recover
>> > >> IMO but it looks like we don't have proper handling in log replay for
>> > >> this case. Shouldn't we treat transaction decoding failure (multi-block
>> > >> one only?) as a good stop condition, not as unrecoverable one we have
>> > >> for now:
>> > >>
>> > >>      bluefs_transaction_t t;
>> > >>      try {
>> > >>        auto p = bl.cbegin();
>> > >>        decode(t, p);
>> > >>      }
>> > >>      catch (ceph::buffer::error& e) {
>> > >>        derr << __func__ << " 0x" << std::hex << pos << std::dec
>> > >>             << ": stop: failed to decode: " << e.what()
>> > >>             << dendl;
>> > >>        delete log_reader;
>> > >>        return -EIO;
>> > >>      }
>> > >>
>> > >> What do you think? Is the above analysis valid?
>> > >>
>> > >> Any other thoughts?
>> > >>
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Igor
>> > >>
>> > > _______________________________________________
>> > > Dev mailing list -- dev@xxxxxxx
>> > > To unsubscribe send an email to dev-leave@xxxxxxx
>> >
>> > _______________________________________________
>> > Dev mailing list -- dev@xxxxxxx
>> > To unsubscribe send an email to dev-leave@xxxxxxx
>> >
>>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx