Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Igor Fedotov <igor.fedotov@xxxxxxxx> · Mon, 21 Feb 2022 23:19:08 +0300

Hey Sebastian,

thanks a lot for the new logs - looks like they provides some insight. 
At this point I think the root cause is apparently a race between 
deferred writes replay and some DB maintenance task happening on OSD 
startup. It seems that deferred write replay updates a block extent 
which RocksDB/BlueFS are using. Hence the target BlueFS file gets 
all-zeros content. Evidently that's just a matter of chance whether they 
use conflicting physical extent or not hence the occasional nature of 
the issue...

And now I'd like to determine what's wrong with this deferred write replay.

So first of all I'm curious if you have any particular write patterns 
that can be culprits? E.g. something like disk wiping procedure which 
writes all-zeros to an object followed by object truncate or removal 
comes to my mind. If you can identify something like that - could you 
please collect OSD log for such an operation (followed by OSD restart) 
with debug-bluestore set to 20?

Thanks,

Igor

On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
Hi Igor,

today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third time! Coincidence? Probably not…
The important thing seams to be that a shutdown and not only a restart of the entire cluster is performed. Since, this time the OSD failed after just 4 shutdowns of all nodes in the cluster within 70 minutes.

I redeployed the OSD.7 after the crash from 2 days ago. And I started this new shutdown and boot series shortly after ceph had finished writing everything back to OSD.7, earlier today.

The corrupted RocksDB file (crash) is again only 2KB in size.
You can download the RocksDB file with the bad  table magic number and the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
What else do you want?

 From the log of the OSD.7:
—————
2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal Terminated ***
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown (osd_fast_shutdown=true) ***
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, pid 1967
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty --pid-file
2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
—————

For me this looks like that the OSD did nothing for nearly 2 minutes before it receives the termination request. Shouldn't this be enough time for flushing every imaginable write cache?

I hope this helps you.

Best wishes,
Sebastian

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx