Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza <sebastian@xxxxxxxxxxx> · Wed, 23 Feb 2022 23:56:48 +0100

Hi Alexander,

thank you for your suggestion! All my Nodes have ECC memory. However, I have now checked that it was recognized correctly on every system (dmesg | grep EDAC). Furthermore I checkt if an error occurred by using `edac-util` and also by searching in the logs of the mainboard BMCs. Everything look perfectly fine. So I think we can now be sure that it is not a memory issue.

Thanks for reminding me to check memory!

Best regards,
Sebastian

> On 23.02.2022, at 02:03, Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote:
> 
> I have another suggestion: check the RAM, just in case, with memtest86
> or https://github.com/martinwhitaker/pcmemtest (which is a fork of
> memtest86+). Ignore the suggestion if you have ECC RAM.
> 
> вт, 22 февр. 2022 г. в 15:45, Igor Fedotov <igor.fedotov@xxxxxxxx>:
>> 
>> Hi Sebastian,
>> 
>> On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
>>> Hey Igor!
>>> 
>>> 
>>>> thanks a lot for the new logs - looks like they provides some insight.
>>> I'm glad the logs are helpful.
>>> 
>>> 
>>>> At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...
>>> 
>>> Do I understand that correct: The corruption of the rocksDB (Table overwritten by zeros) happens at the first start of the OSD after  “*** Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system launches the OSD Service the RocksDB is still fine?
>> Looks like that. From logs I can see an unexpected write to specific
>> extent (LBA 0x630000) which shouldn't occur and at which RocksDB
>> subsequently fails.
>>> 
>>> 
>>>> So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?
>>> Best to my knowledge the OSD was hardly doing anything and I do not see any pattern that would fit to you explanation.
>>> However, you certainly understand a lot more about it than I do, so I try to explain everything that could be relevant.
>>> 
>>> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD with 2TB and the other two are on 12TB HDDs.
>>> 
>>> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is a replicated rule that uses only OSDs with class ’nvme’. The second rule is a tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with class ‘hdd’. So it only works for a size of exactly 8. That means that a pool that uses this rule has always only “undersized” placement groups, since the cluster has only 3 nodes. (I did not add the fourth server after the first crash in December, since we want to reproduce the problem.)
>>> 
>>> The pools device_health_metrics, test-pool, fs.metadata-root-pool, fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with k=5,m=3 and a min size of 6.
>>> 
>>> The pool fs.data-nvme.c-pool is not used and the pool test-pool was used for rados bench a few month ago.
>>> 
>>> The pool fs.metadata-root-pool is used as metadata pool for cephFS and fs.data-root-pool as the root data pool for the cephFS. The pool fs.data-hdd.ec-pool is an additional data pool for the cephFS and is specified as ceph.dir.layout for some folders of the cephFS. The whole cephFS is mounted by each of the 3 nodes.
>>> 
>>> The pool block-nvme.c-pool hosts two RBD images that are used as boot drives for two VMs. The first VM runes with Ubuntu Desktop and the second with Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data part, metadata on block-nvme.c-pool) that is attached to the Debian VM as second drive formatted with BTRFS. Furthermore the Debian VM mounts a sub directory of the cephFS that has the fs.data-hdd.ec-pool set as layout. Both VMs was doing nothing, except from being booted, in the last couple of days.
>>> 
>>> I try to illustrate the pool usage as a tree:
>>> * c3nvme (replicated, size=3, min_size=2)
>>>     + device_health_metrics
>>>     + test-pool
>>>         - rados bench
>>>     + fs.metadata-root-pool
>>>         - CephFS (metadata)
>>>     + fs.data-root-pool
>>>         - CephFS (data root)
>>>     + fs.data-nvme.c-pool
>>>     + block-nvme.c-pool
>>>         - RBD (Ubuntu VM, boot disk with ext4)
>>>      - RBD (Debian VM, boot disk with ext4)
>>> * ec4x2hdd (ec,  k=5, m=3, min_size=6)
>>>     + fs.data-hdd.ec-pool
>>>         - CephFS (layout data)
>>>     + block-hdd.ec-pool
>>>         - RBD (Debian VM, disk 2 with BTRFS)
>>> 
>>> 
>>> Last week I was writing about 30TB to the CephFS inside the fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of hard links. There was already nearly the same data on the storage and I used rsync to update it. I think something between 1 and 4 TB has changed and was updated by rsync.
>>> 
>>> I think the cluster was totally unused on Friday, but up and running and idling around. Then on Saturday I did a graceful shutdown of of all cluster nodes. Arround 5 minutes later when booted the servers again, the OSD.7 crashed. I copied the logs and exported the RocksDB. Then I deleted everything from the HDD and deployed the OSD.7 again. When, I checkt for the first time today at around 12:00, ceph was already finished with backfilling to OSD.7 and the cluster was idle again.
>>> 
>>> I then spend 70 minutes with writing 3 small files (one with about 500Byte and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and shutting down the whole cluster. After I booted all 3 nodes again I did overwrite the 3 small files (some Bytes have changed) and shouted down all 3 nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 7 crashed again.
>>> 
>>> There may be was some more IO caused by mounting the RBD images with the BTRFS volume inside the Debian VM. But I do not know. I can only say that no user did use it. Only I was writing 4 times the 3 small files to cephFS on the fs.data-hdd.ec-pool.
>>> 
>>> Does this make any sense to you?
>> Well, that's actually not a write pattern I would expect to cause the
>> issue. But thanks a lot anyway. So could you please repeat that
>> procedure with overwriting 3 small files and OSD restart and collect the
>> logs with debug-bluestore set to 20.
>>> 
>>>> with debug-bluestore set to 20
>>> Do you mean
>>> ```
>>> [osd]
>>>      debug bluestore = 20
>>> ```
>>> in ceph.conf?
>> yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' instead
>> 
>>> Thanks,
>>> Sebastian
>>> 
>>> 
>>> 
>>> 
>>>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>>> 
>>>> Hey Sebastian,
>>>> 
>>>> thanks a lot for the new logs - looks like they provides some insight. At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent which RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros content. Evidently that's just a matter of chance whether they use conflicting physical extent or not hence the occasional nature of the issue...
>>>> 
>>>> And now I'd like to determine what's wrong with this deferred write replay.
>>>> 
>>>> So first of all I'm curious if you have any particular write patterns that can be culprits? E.g. something like disk wiping procedure which writes all-zeros to an object followed by object truncate or removal comes to my mind. If you can identify something like that - could you please collect OSD log for such an operation (followed by OSD restart) with debug-bluestore set to 20?
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Igor
>>>> 
>>>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
>>>>> Hi Igor,
>>>>> 
>>>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third time! Coincidence? Probably not…
>>>>> The important thing seams to be that a shutdown and not only a restart of the entire cluster is performed. Since, this time the OSD failed after just 4 shutdowns of all nodes in the cluster within 70 minutes.
>>>>> 
>>>>> I redeployed the OSD.7 after the crash from 2 days ago. And I started this new shutdown and boot series shortly after ceph had finished writing everything back to OSD.7, earlier today.
>>>>> 
>>>>> The corrupted RocksDB file (crash) is again only 2KB in size.
>>>>> You can download the RocksDB file with the bad  table magic number and the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
>>>>> What else do you want?
>>>>> 
>>>>> From the log of the OSD.7:
>>>>> —————
>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left
>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal Terminated ***
>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown (osd_fast_shutdown=true) ***
>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 (ceph:ceph)
>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, pid 1967
>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty --pid-file
>>>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>>>>> —————
>>>>> 
>>>>> For me this looks like that the OSD did nothing for nearly 2 minutes before it receives the termination request. Shouldn't this be enough time for flushing every imaginable write cache?
>>>>> 
>>>>> 
>>>>> I hope this helps you.
>>>>> 
>>>>> 
>>>>> Best wishes,
>>>>> Sebastian
>>>>> 
>>>> --
>>>> Igor Fedotov
>>>> Ceph Lead Developer
>>>> 
>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>> 
>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>> Com. register: Amtsgericht Munich HRB 231263
>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>> 
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>> 
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> 
> -- 
> Alexander E. Patrakov

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx