Re: bluefs transaction serialization atomicity.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I think that any outcome possible under normal operation (abnormal
poweroff is normal) should not prevent the OSD from booting.
-Sam

On Wed, Aug 18, 2021 at 11:57 PM Adam Kupczyk <akupczyk@xxxxxxxxxx> wrote:
>
> Hi,
>
> I think in our policy of writing the log, we have no way to distinguish between legit end-of-log
> caused by abnormal poweroff and by data corruption.
> With exception of being requested to read more data but having no more extents.
>
> My thinking would be that such error should be fatal for OSD reboot,
> but acceptable for `bluestore-tool fsck`.
> Another approach would be to accept crc failure if size of transaction was larger
> then bdev_block_size, but for smaller then that treat as fatal errors.
>
> In addition I strongly believe that with new OP_FILE_UPDATE_INC we will almost never have
> more then 4K of single log transaction.
>
> Best regards,
> Adam
>
> On Mon, 16 Aug 2021 at 13:29, Igor Fedotov <ifedotov@xxxxxxx> wrote:
>>
>> Hi folks,
>>
>> the preface:
>>
>> This email has been inspired by the following ticket:
>> https://tracker.ceph.com/issues/52079
>>
>> which reports about two apparently similar bluefs corruptions caused by
>> unexpected power off.
>>
>> In the both cases BlueFS finally refuses to replay the resulting log
>> with pretty the same pattern:
>>
>> - read the next 4K block (op header),
>>
>> - then do more block readings (additional 8K, apparently due to bluefs
>> op/transaction being longer than 4K)
>>
>> - which finally causes faulty replay termination due to transaction (not
>> op header!) checksum failure (both times 'expected checksum', i.e. the
>> one read from disk in the last(!) block, is equal to 0).
>>
>> Here is the sample:
>>
>> "debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h
>> 0x559f918f7f80 0xb5000~1000 from file(ino 1 size 0xb5000 mtime 0.000000
>> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000])
>>
>> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left
>> 0x5b000 len 0x1000
>> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got
>> 4096debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _replay
>> need 0x2000 more bytes
>> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 10 bluefs _read h
>> 0x559f918f7f80 0xb6000~2000 from file(ino 1 size 0xb5000 mtime 0.000000
>> allocated 410000 extents [1:0xb558660000~10000,1:0xb652780000~400000])
>> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read left
>> 0x5a000 len 0x2000
>> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 20 bluefs _read got 8192
>> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs _replay
>> 0xb5000: stop: failed to decode: bad crc 1492738775 expected 0:
>> Malformed input
>>
>> debug 2021-08-14T05:35:38.827+0000 7f71b2b39080 -1 bluefs mount failed
>> to replay log: (5) Input/output error
>>
>> "
>>
>> The above makes me think about the incomplete multi-block transaction
>> writing happened during power off. This state should be fine for recover
>> IMO but it looks like we don't have proper handling in log replay for
>> this case. Shouldn't we treat transaction decoding failure (multi-block
>> one only?) as a good stop condition, not as unrecoverable one we have
>> for now:
>>
>>      bluefs_transaction_t t;
>>      try {
>>        auto p = bl.cbegin();
>>        decode(t, p);
>>      }
>>      catch (ceph::buffer::error& e) {
>>        derr << __func__ << " 0x" << std::hex << pos << std::dec
>>             << ": stop: failed to decode: " << e.what()
>>             << dendl;
>>        delete log_reader;
>>        return -EIO;
>>      }
>>
>> What do you think? Is the above analysis valid?
>>
>> Any other thoughts?
>>
>>
>> Thanks,
>>
>> Igor
>>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux