Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

Jonas Jelten <jelten@xxxxxxxxx> · Tue, 27 Apr 2021 11:43:21 +0200

Hi!

Unfortunately no, I've done some digging but didn't find a cause or solution yet.

Igor, whats your suggestion how we should look for a solution?

-- Jonas

On 27/04/2021 09.47, Dan van der Ster wrote:
> Hi,
> 
> Just pinging to check if this issue was understood yet?
> 
> Cheers, Dan
> 
> On Mon, Apr 12, 2021 at 9:12 PM Jonas Jelten <jelten@xxxxxxxxx> wrote:
>>
>> Hi Igor!
>>
>> I have plenty of OSDs to loose, as long as the recovery works well afterward, so I can go ahead with it :D
>>
>> What debug flags should I activate? osd=10, bluefs=20, bluestore=20, rocksdb=10, ...?
>>
>> I'm not sure it's really the transaction size, since the broken WriteBatch is dumped, and the command index is out of range (that's the WriteBatch tag).
>> I don't see why the transaction size would result in such a corruption - my naive look at the rocksdb sources looks like 14851 repairs shouldn't overflow the 32-bit WriteBatch entry counter, but who knows.
>>
>> Are rocksdb keys like this normal? If yes, what's the construction logic? The pool is called 'dumpsite'.
>>
>> 0x80800000000000000a194027'Rdumpsite!rbd_data.6.28423ad8f48ca1.0000000001b366ff!='0xfffffffffffffffeffffffffffffffff'o'
>> 0x80800000000000000a1940f69264756d'psite!rbd_data.6.28423ad8f48ca1.00000000011bdd0c!='0xfffffffffffffffeffffffffffffffff'o'
>>
>>
>> -- Jonas
>>
>>
>>
>>
>>
>> On 12/04/2021 16.54, Igor Fedotov wrote:
>>> Sorry for being too late to the party...
>>>
>>> I think the root cause is related to the high amount of repairs made during the first post-upgrade fsck run.
>>>
>>> The check (and fix) for zombie spanning blobs was been backported to v15.2.9 (here is the PR https://github.com/ceph/ceph/pull/39256). And I presumt it's the one which causes BlueFS data corruption due to huge transaction happening during such a repair.
>>>
>>> I haven't seen this exact issue (as having that many zombie blobs is a rarely met bug by itself) but we had to some degree similar issue with upgrading omap names, see: https://github.com/ceph/ceph/pull/39377
>>>
>>> Huge resulting transaction could cause too big write to WAL which in turn caused data corruption (see https://github.com/ceph/ceph/pull/39701)
>>>
>>> Although the fix for the latter has been merged for 15.2.10 some additional issues with huge transactions might still exist...
>>>
>>>
>>> If someone can afford another OSD loss it could be interesting to get an OSD log for such a repair with debug-bluefs set to 20...
>>>
>>> I'm planning to make a fix to cap transaction size for repair in the nearest future anyway though..
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx